The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This package contains a collection of various (low-level) tools which may be of general interest. These functions were accumulated over a number of years of data-wrangling when treating high-throughput data from biomedical applications. Besides, these functions are further used/integrated in more specialized functions dedicated to specific applications in the packages wrProteo, wrGraph or wrTopDownFrag. All these packages are available on CRAN.
If you are not familiar with R you may find many introductory documents on the official R-site in contributed documents or under Documentation/Manuals. Of course, numerous other documents/sites with tutorials and courses exist, too.
One of the aims was to write a package easy to install, with low
system requirements and few obligatory dependencies.
All code is written in pure R and does not need any special compilers.
The number of obligatory dependencies was kept to a minumum.
Most of additional packages used in some of the functions were declared as ‘suggested’ (ie not obligatory), to allow installation of wrMisc even if some these additional packages can’t be installed/compiled by the user’s instance. When a feature/function of one of the ‘suggested’ packages is about to be used, its presence/installation will be checked and, only if found as missing, the user will be prompted a message inviting to install specific package(s) before using these specific functions. This helps to avoid not being able installing this package at all if some dependencies may fail to get installed themselves.
To get started, we need to install (if not yet installed) and load the package “wrMisc” available from CRAN.
## If not already installed, you'll have to install the package first.
## This is the basic installation commande in R
install.packages("wrMisc")
Since the functions illustrated in this vignette require a number of the suggested packages, let’s check if they are installed and add them (via a small function), if not yet installed.
packages <- c("knitr", "rmarkdown", "BiocManager", "kableExtra", "boot", "data.tree", "data.table",
"fdrtool", "RColorBrewer", "Rcpp", "wrMisc", "wrGraph", "wrProteo")
checkInstallPkg <- function(pkg) { # install function
if(!requireNamespace(pkg, quietly=TRUE)) install.packages(pkg) }
## install if not yet present
sapply(packages, checkInstallPkg)
Finally, this package also uses the Bioconductor package limma which has to be installed differently (see also help on Bioconductor):
## Installation of limma
BiocManager::install("limma")
This vignette is also accessible from R command-line or on CRAN at wrMisc:
## Now you can open this vignette out of R:
vignette("wrMiscVignette1", package="wrMisc")
Before using the functions of this package, we actually need to load the package first (best on a fresh R-session):
library("wrMisc")
library("knitr")
## This is 'wrMisc' version number :
packageVersion("wrMisc")
## [1] '1.15.2'
In high-throughput experiments in biology (like transcriptomics, proteomics etc…) many different features get measured a number if times (different samples like patients or evolution of a disease). The resulting data typically contain many (independent) rows (eg >1000 different genes or proteins who’s abundance was measured) and much fewer columns that may get further organized in groups of replicates. As R is a versatile language, multiple options exist for assessing the global characteristics of such data, some are more efficient on a computational point of view. In order to allow fast treatment of very large data-sets some tools have been re-designed for optimal performance.
Many measurement techniques applied in high throughput manner suffer from precision. This means, the same measurements taken twice in a row (ie repeated on the same subject) will very likely not give an identical result. For this reason it is common practice to make replicate measurements to i) estimate mean (ie representative) values and ii) asses the factors contributing to the variablity observed. Briefly, technical replicates represent the case where multiple read-outs of the very same sample are generated and the resulting variability is associated to technical issues during the process of taking measures. Biological replicates represent independant samples and reflect therefore the varibility a given parameter may have in a certain population of individuals. With the tools presented here, both technical and biological replicates can be dealt with. In several cases the interpretation of the resulting numbers should consider the experimental setup, though.
Let’s make a simple matrix as toy data:
grp1 <- rep(LETTERS[1:3], c(3,4,3))
sampNa1 <- paste0(grp1, c(1:3,1:4,1:3))
set.seed(2016); dat1 <- matrix(round(c(runif(50000) +rep(1:1000,50)),3),
ncol=10, dimnames=list(NULL,sampNa1))
dim(dat1)
## [1] 5000 10
head(dat1)
## A1 A2 A3 B1 B2 B3 B4 C1 C2 C3
## [1,] 1.180 1.640 1.199 1.118 1.425 1.745 1.253 1.554 1.303 1.856
## [2,] 2.143 2.237 2.730 2.693 2.603 2.293 2.542 2.452 2.148 2.776
## [3,] 3.842 3.155 3.191 3.520 3.686 3.408 3.409 3.871 3.345 3.588
## [4,] 4.134 4.394 4.982 4.320 4.380 4.888 4.965 4.462 4.250 4.647
## [5,] 5.478 5.472 5.488 5.570 5.626 5.765 5.551 5.016 5.659 5.139
## [6,] 6.121 6.294 6.718 6.890 6.999 6.316 6.542 6.119 6.763 6.487
Now lets estimate the standard deviation (sd) for every row:
head(rowSds(dat1))
## [1] 0.2583693 0.2426026 0.2477899 0.3089102 0.2307722 0.3124493
system.time(sd1 <- rowSds(dat1))
## user system elapsed
## 0 0 0
system.time(sd2 <- apply(dat1, 1, sd))
## user system elapsed
## 0.02 0.00 0.01
On most systems the equivalent calculation using apply()
will run much slower compared to rowSds
.
Note, there is a minor issue with rounding :
table(round(sd1, 13)==round(sd2, 13))
##
## FALSE TRUE
## 1 4999
Similarly we can easily calculate the CV (coefficient of variation,
ie sd / mean, see also CV)
for every row using rowCVs
:
system.time(cv1 <- rowCVs(dat1))
## user system elapsed
## 0 0 0
system.time(cv2 <- apply(dat1, 1, sd) / rowMeans(dat1))
## user system elapsed
## 0.03 0.00 0.04
# typically the calculation using rowCVs is much faster
head(cv1)
## [1] 0.18101959 0.09855083 0.07076678 0.06800894 0.04213940 0.04788568
# results from the 'conventional' way
head(cv2)
## [1] 0.18101959 0.09855083 0.07076678 0.06800894 0.04213940 0.04788568
Note, these calculations will be very efficient as long as the number of rows is much higher (>>) than the number of columns.
Now, let’s assume our data is contains 3 initial samples measured as
several replicates (already defined in grp1). Similarly, we can
also calculate the sd or CV for each line while splitting into groups of
replicates (functions rowGrpMeans
, rowGrpSds
and rowGrpCV
):
# we already defined the grouping :
grp1
## [1] "A" "A" "A" "B" "B" "B" "B" "C" "C" "C"
## the mean for each group and row
system.time(mean1Gr <- rowGrpMeans(dat1, grp1))
## user system elapsed
## 0 0 0
## Now the sd for each row and group
system.time(sd1Gr <- rowGrpSds(dat1, grp1))
## user system elapsed
## 0 0 0
# will give us a matrix with the sd for each group & line
head(sd1Gr)
## A B C
## [1,] 0.260269732 0.27074758 0.2768917
## [2,] 0.315291928 0.17144557 0.3140531
## [3,] 0.386666523 0.13115989 0.2632534
## [4,] 0.434443706 0.33521970 0.1986530
## [5,] 0.008082904 0.09672297 0.3413156
## [6,] 0.307168249 0.31475851 0.3230934
# Let's check the results of the first line :
sd1Gr[1,] == c(sd(dat1[1,1:3]), sd(dat1[1,4:7]), sd(dat1[1,8:10]))
## A B C
## TRUE TRUE TRUE
# The CV :
system.time(cv1Gr <- rowGrpCV(dat1, grp1))
## user system elapsed
## 0.02 0.00 0.01
head(cv1Gr)
## A B C
## [1,] 0.194279471 0.19545033 0.17625186
## [2,] 0.133034569 0.06769147 0.12773308
## [3,] 0.113859400 0.03741279 0.07309886
## [4,] 0.096471585 0.07227288 0.04461104
## [5,] 0.001475162 0.01718603 0.06474939
## [6,] 0.048163108 0.04707197 0.05004286
Some data, like with quantitative proteomics measures, may contain an elevated number of NAs (see also the package wrProteo for further options for dealing with such data). Furthermore, many other packages on CRAN and Bioconductor cover this topic, see also the missing data task-view on CRAN. Similar as above there is an easy way to count the number of NAs to get an overview how NAs are distributed.
Let’s assume we have measures from 3 groups/samples with 4 replicates each :
mat2 <- c(22.2, 22.5, 22.2, 22.2, 21.5, 22.0, 22.1, 21.7, 21.5, 22, 22.2, 22.7,
NA, NA, NA, NA, NA, NA, NA, 21.2, NA, NA, NA, NA,
NA, 22.6, 23.2, 23.2, 22.4, 22.8, 22.8, NA, 23.3, 23.2, NA, 23.7,
NA, 23.0, 23.1, 23.0, 23.2, 23.2, NA, 23.3, NA, NA, 23.3, 23.8)
mat2 <- matrix(mat2, ncol=12, byrow=TRUE)
## The definition of the groups (ie replicates)
gr4 <- gl(3, 4, labels=LETTERS[1:3])
Now we can easily count the number of NAs per row and set of replicates.
rowGrpNA(mat2,gr4)
## A B C
## [1,] 0 0 0
## [2,] 4 3 4
## [3,] 1 1 1
## [4,] 1 1 2
The function na.omit() from the package stats also
keeps a trace of all omitted instances. This can be penalizing in terms
of memory usage when handling very large vectors with a high content of
NAs (eg >10000 NAs). If you don’t need to document precisely which
elements got eliminated, the function naOmit()
may offer
smoother functioning for very large objects.
aA <- c(11:13,NA,10,NA)
str(naOmit(aA))
## num [1:4] 11 12 13 10
# the 'classical' na.omit also stores which elements were NA
str(na.omit(aA))
## num [1:4] 11 12 13 10
## - attr(*, "na.action")= 'omit' int [1:2] 4 6
If you need to find the closest neighbour(s) of a numeric vector, the
function minDiff()
will tell you the distance (“dif”,“ppm”
or “ratio”) and index (“best”) of the closest neighbour. In case of
multiple shortest distances the index if the first one is reported, and
the column “nbest” will display a value of >1.
set.seed(2017); aa <- 10 *c(0.1 +round(runif(20),2), 0.53, 0.53)
head(aa)
## [1] 10.2 6.4 5.7 3.9 8.7 8.7
minDiff(aa,ppm=FALSE)
## index value dif rat ncur nbest best
## [1,] 1 10.2 -0.2 0.981 1 1 19
## [2,] 2 6.4 0.4 1.070 1 1 15
## [3,] 3 5.7 0.3 0.950 2 1 15
## [4,] 4 3.9 0.2 1.050 1 1 10
## [5,] 5 8.7 0.5 1.060 2 1 18
## [6,] 6 8.7 0.5 1.060 2 1 18
## [7,] 7 1.4 0.1 1.080 1 1 13
## [8,] 8 5.3 0.3 1.060 4 1 17
## [9,] 9 5.7 0.3 0.950 2 1 15
## [10,] 10 3.7 -0.2 0.949 1 1 4
## [11,] 11 7.7 -0.5 0.939 1 1 18
## [12,] 12 1.0 -0.3 0.769 1 1 13
## [13,] 13 1.3 -0.1 0.929 1 1 7
## [14,] 14 5.3 0.3 1.060 4 1 17
## [15,] 15 6.0 0.3 1.050 1 2 9
## [16,] 16 4.9 -0.1 0.980 1 1 17
## [17,] 17 5.0 0.1 1.020 1 1 16
## [18,] 18 8.2 0.5 1.060 1 1 11
## [19,] 19 10.4 0.2 1.020 1 1 1
## [20,] 20 9.3 0.6 1.070 1 2 6
## [21,] 21 5.3 0.3 1.060 4 1 17
## [22,] 22 5.3 0.3 1.060 4 1 17
When you look at the first line, the value of 10.2 has one single closest value which is 10.4, which is located in line number 19 (the column ‘best’ gives the index of the best). Line number 19 points back to line number 1. You can see, that some elements (like 5.7) occure multiple times (line no 3 and 9), multiple occurences are counted in the column ncur. This is why column nbest for line 15 (value =6.0) indicates that it appears twice as closest value nbest.
When input from different places gets collected and combined into a
list, this may give a collection of different types of data. The
function partUnlist()
will to preserve multi-column
elements as they are (and just bring down one level):
bb <- list(fa=gl(2,2), ve=31:33, L2=matrix(21:28,ncol=2), li=list(li1=11:14,li2=data.frame(41:44)))
partUnlist(bb)
## $fa
## [1] 1 1 2 2
## Levels: 1 2
##
## $ve
## [1] 31 32 33
##
## $L2
## [,1] [,2]
## [1,] 21 25
## [2,] 22 26
## [3,] 23 27
## [4,] 24 28
##
## $li1
## [1] 11 12 13 14
##
## $li2
## X41.44
## 1 41
## 2 42
## 3 43
## 4 44
partUnlist(lapply(bb,.asDF2))
## partUnlist : Input is not list of lists, nothing to do
## $fa
## as.character(z)
## 1 1
## 2 1
## 3 2
## 4 2
##
## $ve
## z
## 1 31
## 2 32
## 3 33
##
## $L2
## V1 V2
## 1 21 25
## 2 22 26
## 3 23 27
## 4 24 28
##
## $li
## li1 X41.44
## 1 11 41
## 2 12 42
## 3 13 43
## 4 14 44
This won’t be possible using unlist().
head(unlist(bb, recursive=FALSE))
## $fa1
## [1] 1
##
## $fa2
## [1] 1
##
## $fa3
## [1] 2
##
## $fa4
## [1] 2
##
## $ve1
## [1] 31
##
## $ve2
## [1] 32
To uniform such data to obtain a list with one column only for each
list-element, the function asSepList()
provides help :
bb <- list(fa=gl(2,2), ve=31:33, L2=matrix(21:28,ncol=2), li=list(li1=11:14,li2=data.frame(41:44)))
asSepList(bb)
## $fa
## [1] 1 1 2 2
## Levels: 1 2
##
## $L2_1
## [1] 21 22 23 24
##
## $L2_2
## [1] 25 26 27 28
##
## $li1
## [1] 11 12 13 14
##
## $li2
## [1] 41 42 43 44
Separate lists may be combined using the append() command, which also allows treating simple vectors.
li1 <- list(a=1, b=2, c=3)
li2 <- list(A=11, b=2, C=13)
append(li1, li2)
## $a
## [1] 1
##
## $b
## [1] 2
##
## $c
## [1] 3
##
## $A
## [1] 11
##
## $b
## [1] 2
##
## $C
## [1] 13
However, this way there is no checking if some of the list-elements
are present in both lists and thus will appear twice. The function
appendNR()
allows to checking if some list-elements will
appear twice, and thus avoid such duplicate entries.
appendNR(li1, li2)
## appendNR : adding 2 new names/elements (1 already present)
## $a
## [1] 1
##
## $b
## [1] 2
##
## $c
## [1] 3
##
## $A
## [1] 11
##
## $C
## [1] 13
When a matrix (or data.frame) gets split into a list, like in the
example using by(), as a reverse-function such lists can get
joined using lrbind()
in an rbind-like
fashion.
dat2 <- matrix(11:34, ncol=3, dimnames=list(letters[1:8], colnames=LETTERS[1:3]))
lst2 <- by(dat2, rep(1:3,c(3,2,3)), as.matrix)
lst2
## INDICES: 1
## A B C
## a 11 19 27
## b 12 20 28
## c 13 21 29
## ------------------------------------------------------------
## INDICES: 2
## A B C
## d 14 22 30
## e 15 23 31
## ------------------------------------------------------------
## INDICES: 3
## A B C
## f 16 24 32
## g 17 25 33
## h 18 26 34
# join list-elements (back) into single matrix
lrbind(lst2)
## A B C
## a 11 19 27
## b 12 20 28
## c 13 21 29
## d 14 22 30
## e 15 23 31
## f 16 24 32
## g 17 25 33
## h 18 26 34
When combining different datasets the function
mergeMatrixList()
allows merging multiple matrices (or
data.frames) into a single matrix. Two types of mode of operation are
available : i) Returning only the common/shared elements (as defined by
the rownames), this is default mode=‘intersect’ ; alternatively
one may ii) fuse/merge all matrices together without any loss of data
(using mode=‘union’, additional _NA_s may appear when a given
rowname is absent in one of the input matrices).
Furthermore, one may specifically select which columns should be used for fusing using the argument useColumn.
mat1 <- matrix(11:18, ncol=2, dimnames=list(letters[3:6],LETTERS[1:2]))
mat2 <- matrix(21:28, ncol=2, dimnames=list(letters[2:5],LETTERS[3:4]))
mat3 <- matrix(31:38, ncol=2, dimnames=list(letters[c(1,3:4,3)],LETTERS[4:5]))
#
mergeMatrixList(list(mat1, mat2), useColumn="all")
## A B C D
## c 11 15 22 26
## d 12 16 23 27
## e 13 17 24 28
# with custom names for the individual matrices
mergeMatrixList(list(m1=mat1, m2=mat2, mat3), mode="union", useColumn=2)
## m1.B m2.D list_3.E
## a NA NA 35
## b NA 25 NA
## c 15 26 38
## d 16 27 37
## e 17 28 NA
## f 18 NA NA
Similarly, separate entries may be merged using
mergeMatrices()
:
mergeMatrices(mat1, mat2)
## A
## c 11 22
## d 12 23
## e 13 24
mergeMatrices(mat1, mat2, mat3, mode="union", useColumn=2)
## mat1.B mat2.D mat3.E
## a NA NA 35
## b NA 25 NA
## c 15 26 38
## d 16 27 37
## e 17 28 NA
## f 18 NA NA
## custom names for matrix-origin
mergeMatrices(m1=mat1, m2=mat2, mat3, mode="union", useColumn=2)
## m1.B m2.D mat3.E
## a NA NA 35
## b NA 25 NA
## c 15 26 38
## d 16 27 37
## e 17 28 NA
## f 18 NA NA
## flexible/custom selection of columns
mergeMatrices(m1=mat1, m2=mat2, mat3, mode="union", useColumn=list(1,1:2,2))
## m1.A m2.C m2.D mat3.E
## a NA NA NA 35
## b NA 21 25 NA
## c 11 22 26 38
## d 12 23 27 37
## e 13 24 28 NA
## f 14 NA NA NA
When list-elements have the same name, their content (of named
numeric or character vectors) may get fused using
fuseCommonListElem()
according to the names of the
list-elements :
val1 <- 10 +1:26
names(val1) <- letters
(lst1 <- list(c=val1[3:6], a=val1[1:3], b=val1[2:3] ,a=val1[12], c=val1[13]))
## $c
## c d e f
## 13 14 15 16
##
## $a
## a b c
## 11 12 13
##
## $b
## b c
## 12 13
##
## $a
## l
## 22
##
## $c
## m
## 23
## here the names 'a' and 'c' appear twice :
names(lst1)
## [1] "c" "a" "b" "a" "c"
## now, let's fuse all 'a' and 'c'
fuseCommonListElem(lst1)
## $c
## c d e f m
## 13 14 15 16 23
##
## $a
## a b c l
## 11 12 13 22
##
## $b
## b c
## 12 13
In a number of cases the information in various list-elements is somehow related. Eg, in S3-objects produced by limma, or data produced using wrProteo several instances of matrix or data.frame refer to data that are related. Some matrixes may conatain abundance data (or weights, etc) while another matrix or data.frame may contain the annotation information related to each line of the abundance data. So if one wants to filter the data, ie remove some lines, this should be done in the same way with all related list-elements. This way one may maintain a conventient 1:1 matching of lines.
The function filterLiColDeList()
searches if other
list-elements have suitable dimensions and will then run the same
filtering as in the ‘target’ list-element. In consequence this can be
used with the output of wrProteo to remove simultaneously the same lines
and/or columns.
lst1 <- list(m1=matrix(11:18, ncol=2), m2=matrix(21:30, ncol=2), indR=31:34,
m3=matrix(c(21:23,NA,25:27,NA), ncol=2))
filterLiColDeList(lst1, useLines=2:3)
## filterLiColDeList : successfully filtered 'm1' and 'm3' from 4 to 2 lines
## $m1
## [,1] [,2]
## [1,] 12 16
## [2,] 13 17
##
## $m2
## [,1] [,2]
## [1,] 21 26
## [2,] 22 27
## [3,] 23 28
## [4,] 24 29
## [5,] 25 30
##
## $indR
## [1] 31 32 33 34
##
## $m3
## [,1] [,2]
## [1,] 22 26
## [2,] 23 27
filterLiColDeList(lst1, useLines="allNA", ref=3)
## filterLiColDeList : It appears lst[[ref]] is not matrix (or data.frame) ! Trying to reformat ..
## filterLiColDeList : 'useLines' seems empty, nothing to do ...
## $m1
## [,1] [,2]
## [1,] 11 15
## [2,] 12 16
## [3,] 13 17
## [4,] 14 18
##
## $m2
## [,1] [,2]
## [1,] 21 26
## [2,] 22 27
## [3,] 23 28
## [4,] 24 29
## [5,] 25 30
##
## $indR
## [,1]
## [1,] 31
## [2,] 32
## [3,] 33
## [4,] 34
##
## $m3
## [,1] [,2]
## [1,] 21 25
## [2,] 22 26
## [3,] 23 27
## [4,] NA NA
The function listBatchReplace()
works similar to
sub() and allows to search & replace exact matches to a
character string along all elements of a list.
(lst1 <- list(aa=1:4, bb=c("abc","efg","abhh","effge"), cc=c("abdc","efg","efgh")))
## $aa
## [1] 1 2 3 4
##
## $bb
## [1] "abc" "efg" "abhh" "effge"
##
## $cc
## [1] "abdc" "efg" "efgh"
listBatchReplace(lst1, search="efg", repl="EFG", silent=FALSE)
## $aa
## [1] "1" "2" "3" "4"
##
## $bb
## [1] "abc" "EFG" "abhh" "effge"
##
## $cc
## [1] "abdc" "EFG" "efgh"
Named numeric or character vectors can be organized into lists using
listGroupsByNames()
, based on their names (only the part
before any extensions starting with a point gets considered). Of course,
other separators may be defined using the argument sep.
ser1 <- 1:7; names(ser1) <- c("AA","BB","AA.1","CC","AA.b","BB.e","A")
listGroupsByNames(ser1)
## $AA
## AA AA.1 AA.b
## 1 3 5
##
## $BB
## BB BB.e
## 2 6
##
## $CC
## CC
## 4
##
## $A
## A
## 7
If no names are present, the content of the vector itself will be used as name :
listGroupsByNames((1:10)/5)
## listGroupsByNames : no names found in 'x' !!
## $`0`
## 0 0 0 0
## 0.2 0.4 0.6 0.8
##
## $`1`
## 1 1 1 1 1
## 1.0 1.2 1.4 1.6 1.8
##
## $`2`
## 2
## 2
In the view of object-oriented programming several methods produce
results integrated into lists or S3-objects (eg limma).
The function filterList()
aims facilitating the filtering
of all elements of lists or S3-objects. List-elements with inappropriate
number of lines will be ignored.
set.seed(2020); dat1 <- round(runif(80),2)
list1 <- list(m1=matrix(dat1[1:40], ncol=8), m2=matrix(dat1[41:80], ncol=8), other=letters[1:8])
rownames(list1$m1) <- rownames(list1$m2) <- paste0("line",1:5)
# Note: the list-element list1$other has a length different to that of filt. Thus, it won't get filtered.
filterList(list1, list1$m1[,1] >0.4) # filter according to 1st column of $m1 ...
## $m1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.65 0.07 0.76 0.54 0.20 0.17 0.96 0.37
## line3 0.62 0.39 0.83 0.65 0.82 0.75 0.96 0.93
## line4 0.48 0.00 0.42 0.55 0.94 0.45 0.95 0.52
##
## $m2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.99 0.57 0.58 0.00 0.21 0.61 0.61 0.30
## line3 0.86 0.70 0.90 0.22 0.23 0.58 0.39 0.06
## line4 0.88 0.80 0.52 0.54 0.42 0.65 0.47 0.67
##
## $other
## [1] "a" "b" "c" "d" "e" "f" "g" "h"
filterList(list1, list1$m1 >0.4)
## $m1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.65 0.07 0.76 0.54 0.20 0.17 0.96 0.37
## line3 0.62 0.39 0.83 0.65 0.82 0.75 0.96 0.93
## line4 0.48 0.00 0.42 0.55 0.94 0.45 0.95 0.52
## line5 0.14 0.62 0.41 0.27 0.88 0.56 0.00 0.22
##
## $m2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.99 0.57 0.58 0.00 0.21 0.61 0.61 0.30
## line3 0.86 0.70 0.90 0.22 0.23 0.58 0.39 0.06
## line4 0.88 0.80 0.52 0.54 0.42 0.65 0.47 0.67
## line5 0.62 0.17 0.83 0.49 0.86 0.17 0.53 0.72
##
## $other
## [1] "a" "b" "c" "d" "e" "f" "g" "h"
At some occasions it may be useful separate columns of a matrix into
separate vectors inside a list. This can be done using
matr2list()
:
(mat1 <- matrix(1:12, ncol=3, dimnames=list(letters[1:4],LETTERS[1:3])))
## A B C
## a 1 5 9
## b 2 6 10
## c 3 7 11
## d 4 8 12
str(matr2list(mat1))
## List of 3
## $ A: Named num [1:4] 1 2 3 4
## ..- attr(*, "names")= chr [1:4] "A.a" "A.b" "A.c" "A.d"
## $ B: Named num [1:4] 5 6 7 8
## ..- attr(*, "names")= chr [1:4] "B.a" "B.b" "B.c" "B.d"
## $ C: Named num [1:4] 9 10 11 12
## ..- attr(*, "names")= chr [1:4] "C.a" "C.b" "C.c" "C.d"
Let’s get stared with a little toy-array:
(arr1 <- array(c(6:4,4:24), dim=c(4,3,2), dimnames=list(c(LETTERS[1:4]),
paste("col",1:3,sep=""),c("ch1","ch2"))))
## , , ch1
##
## col1 col2 col3
## A 6 5 9
## B 5 6 10
## C 4 7 11
## D 4 8 12
##
## , , ch2
##
## col1 col2 col3
## A 13 17 21
## B 14 18 22
## C 15 19 23
## D 16 20 24
Now we can obtain the CV (coefficient of variance) by splitting along
3rd dimesion (ie this is equivalent to an apply along the 3rd
dimension) using arrayCV()
:
arrayCV(arr1)
## ch1 ch2
## A 0.3122499 0.2352941
## B 0.3779645 0.2222222
## C 0.4788934 0.2105263
## D 0.5000000 0.2000000
# this is equivalent to
cbind(rowCVs(arr1[,,1]), rowCVs(arr1[,,2]))
## [,1] [,2]
## A 0.3122499 0.2352941
## B 0.3779645 0.2222222
## C 0.4788934 0.2105263
## D 0.5000000 0.2000000
Similarly we can split along any other dimension, eg the 2nd dimension :
arrayCV(arr1, byDim=2)
## col1 col2 col3
## A 0.5210260 0.7713892 0.5656854
## B 0.6698906 0.7071068 0.5303301
## C 0.8187552 0.6527140 0.4991342
## D 0.8485281 0.6060915 0.4714045
This procedure is similar to (re-)organizing an initial array into
clusters, here we split along a user-defined factor/vector. If a
clustering-algorithm produces the cluster assignments, this function can
be used to organize the input data accordingly using
cutArrayInCluLike()
.
cutArrayInCluLike(arr1, cluOrg=c(2,1,2,1))
## $`2`
## , , ch1
##
## col1 col2 col3
## A 6 5 9
## C 4 7 11
##
## , , ch2
##
## col1 col2 col3
## A 13 17 21
## C 15 19 23
##
##
## $`1`
## , , ch1
##
## col1 col2 col3
## B 5 6 10
## D 4 8 12
##
## , , ch2
##
## col1 col2 col3
## B 14 18 22
## D 16 20 24
Let’s cut by filtering along the 3rd dimension for all lines where
column ‘col2’ is >7, and then display only the content of columns
‘col1’ and ‘col2’ (using filt3dimArr()
):
filt3dimArr(arr1, displCrit=c("col1","col2"), filtCrit="col2", filtVal=7, filtTy=">")
## [[1]]
## col1 col2
## 4 8
##
## [[2]]
## col1 col2
## A 13 17
## B 14 18
## C 15 19
## D 16 20
\(_Semantics_\) : Please note, that there are two ways of interpreting the term ‘unique’ :
In regular understanding one describes this way an event which occurs only once, and thus does not occur/happen anywhere else.
The command unique()
will eliminate redundant
entries to obtain a shorter ‘unique’ output vector, ie in the resultant
vector all values/content (values) occur only once. However, from the
result of unique() you cannot tell any more which ones were not
unique initially !
In some applications (eg proteomics) initial identifiers (IDs) may
occur multiple times in the data and we frequently need to identify
events/values that occur only once, as the first meaning of
‘unique’. This package provides (additional) functions to
easily distinguish values occurring just once (ie unique) from
those occurring multiple times. Furthermore, there are functions to
rename/remove/combine replicated elements, eg
correctToUnique()
or nonAmbiguousNum()
, so
that no elements or lines of data get lost.
## some text toy data
tr <- c("li0","n",NA,NA, rep(c("li2","li3"),2), rep("n",4))
The function table() (from the package base) is
very useful get some insights when working with smaller objects, but may
be slow to handle very large objects. As mentioned, unique()
will make everything unique, and afterwards you won’t know any more who
was unique in the first place ! The function duplicated()
(also from package base) helps us getting the information who is
repeated.
table(tr)
## tr
## li0 li2 li3 n
## 1 2 2 5
unique(tr)
## [1] "li0" "n" NA "li2" "li3"
duplicated(tr, fromLast=FALSE)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
aa <- c(11:16,NA,14:12,NA,14)
names(aa) <- letters[1:length(aa)]
aa
## a b c d e f g h i j k l
## 11 12 13 14 15 16 NA 14 13 12 NA 14
findRepeated()
(from this package) will return the
position/index (and content/value) of repeated elements. However, the
output in form of a list is not very convenient to the human reader.
findRepeated(aa)
## $`12`
## [1] 2 10
##
## $`13`
## [1] 3 9
##
## $`14`
## [1] 4 8 12
firstOfRepeated()
tells the index of the first instance
of repeated elements, which elements you need to make the vector
‘unique’, and which elements get stripped off when making unique. Please
note, that NA (no matter if they occure once or more times) are
automatically in the part suggested to be removed.
firstOfRepeated(aa)
## $indRepeated
## 12 13 14
## 2 3 4
##
## $indUniq
## a b c d e f
## 1 2 3 4 5 6
##
## $indRedund
## g h i j k l
## 7 8 9 10 11 12
aa[firstOfRepeated(aa)$indUniq] # only unique with their names
## a b c d e f
## 11 12 13 14 15 16
unique(aa) # unique() does not return any names !
## [1] 11 12 13 14 15 16 NA
If necessary, a counter can be added to non-unique entries, thus no
individual values get eliminated and the length and order of the
resultant object maintains the same using
correctToUnique()
.
This is of importance when assigning rownames to a data.frame : Assigning redundant values/text as rownames of a data.frame will result in an error !
correctToUnique(aa)
## a b c d e f g h i j k
## "11" "12_1" "13_1" "14_1" "15" "16" "NA_1" "14_2" "13_2" "12_2" "NA_2"
## l
## "14_3"
correctToUnique(aa, sep=".", NAenum=FALSE) # keep NAs (ie without transforming to character)
## a b c d e f g h i j k
## "11" "12.1" "13.1" "14.1" "15" "16" NA "14.2" "13.2" "12.2" NA
## l
## "14.3"
You see from the last example above, that this function has an argument for controlling enumerating elements.
First, the truly unique values are reported and then the first
occurance of repeated elements is given, NA instances get
ignored. This can be done using nonAmbiguousNum()
which
maintains the length of the initial character vector.
unique(aa) # names are lost
## [1] 11 12 13 14 15 16 NA
nonAmbiguousNum(aa)
## a e f amb_b amb_c
## 11 15 16 12 13
nonAmbiguousNum(aa, uniq=FALSE, asLi=TRUE) # separate in list unique and repeated
## $unique
## a e f
## 11 15 16
##
## $ambig
## amb_b amb_c
## 12 13
The main aim of the function sortByNRepeated()
is
allowing to compare multiple vectors for common values/words and
providing an output sorted by number of repeats.
Suppose 3 persons are asked which cities they wanted to visit. Then we would like to make a counting of the most frequently cited cities. Here we consider individual choices as equally ranked. By default intra-repeats are eliminated.
cities <- c("Bangkok","London","Paris", "Singapore","New York City", "Istambul","Delhi","Rome","Dubai")
sortByNRepeated(x=cities[c(1:4)], y=cities[c(2:3,5:8)])
## $`1`
## [1] "Bangkok" "Delhi" "Istambul" "New York City"
## [5] "Rome" "Singapore"
##
## $`2`
## [1] "London" "Paris"
## or (unlimited) multiple inputs via list
choices1 <- list(Mary=cities[c(1:4)], Olivia=cities[c(2:3,5:8)], Paul=cities[c(5:3,9,5)]) # Note : Paul cited NYC twice !
table(unlist(choices1))
##
## Bangkok Delhi Dubai Istambul London
## 1 1 1 1 2
## New York City Paris Rome Singapore
## 3 3 1 2
sortByNRepeated(choices1)
## $`1`
## [1] "Bangkok" "Delhi" "Dubai" "Istambul" "Rome"
##
## $`2`
## [1] "London" "New York City" "Singapore"
##
## $`3`
## [1] "Paris"
sortByNRepeated(choices1, filterIntraRep=FALSE) # without correcting multiple citation of NYC by Paul
## $`1`
## [1] "Bangkok" "Delhi" "Dubai" "Istambul" "Rome"
##
## $`2`
## [1] "London" "Singapore"
##
## $`3`
## [1] "New York City" "Paris"
Here, it is supposed that you want to join 2 or more matrixes
describing different properties of the same collection of individuals
(as rows). Common column-names are interpreted that their respective
information should be combined (either as average or as sum). This can
be done using cbindNR()
:
## First we'll make some toy data :
(ma1 <- matrix(1:6, ncol=3, dimnames=list(1:2,LETTERS[3:1])))
## C B A
## 1 1 3 5
## 2 2 4 6
(ma2 <- matrix(11:16, ncol=3, dimnames=list(1:2,LETTERS[3:5])))
## C D E
## 1 11 13 15
## 2 12 14 16
## now we can join 2 or more matrixes
cbindNR(ma1, ma2, summarizeAs="mean") # average of both columns 'C'
## cbindNR : treating 5 different (types of) columns : C B A D E
## cbindNR : sorting columns of output
## A B C D E
## 1 5 3 6 13 15
## 2 6 4 7 14 16
This ressembles to the functioning of unique(), but applies to a user-specified column of the matrix.
(mat1 <- matrix(c(1:6, rep(1:3,1:3)), ncol=2, dimnames=list(letters[1:6],LETTERS[1:2])))
## A B
## a 1 1
## b 2 2
## c 3 2
## d 4 3
## e 5 3
## f 6 3
The function firstLineOfDat()
allows to access/extract
the first line of repeated instances.
firstLineOfDat(mat1, refCol=2)
## A B
## a 1 1
## b 2 2
## d 4 3
This function was rather designed for dealing with character input, it allows concatenating all columns and to remove redundant.
mat2 <- matrix(c("e","n","a","n","z","z","n","z","z","b",
"","n","c","n","","","n","","","z"), ncol=2)
firstOfRepLines(mat2, out="conc")
## [1] "e" "nn" "ac" "z" "bz"
# or as index :
firstOfRepLines(mat2)
## [1] 1 2 3 5 10
(df1 <- data.frame(cbind(xA=letters[1:5], xB=c("h","h","f","e","f"), xC=LETTERS[1:5])))
## xA xB xC
## 1 a h A
## 2 b h B
## 3 c f C
## 4 d e D
## 5 e f E
The function nonredDataFrame()
offers to include a
counter of redundant instances encountered (for 1st column specified)
:
nonredDataFrame(df1, useCol=c("xB","xC"))
## xA xB xC nSamePep concID
## 1 a h A 2 C//E
## 3 c f C 2 A//B
## 4 d e D 1 D
# without counter or concatenating
df1[which(!duplicated(df1[,2])),]
## xA xB xC
## 1 a h A
## 3 c f C
## 4 d e D
# or
df1[firstOfRepLines(df1,useCol=2),]
## xA xB xC
## 1 a h A
## 3 c f C
## 4 d e D
mat2 <- cbind(no=as.character(1:20), seq=sample(LETTERS[1:15], 20, repl=TRUE),
ty=sample(c("full","Nter","inter"),20,repl=TRUE), ambig=rep(NA,20), seqNa=1:20)
(mat2uniq <- get1stOfRepeatedByCol(mat2, sortBy="seq", sortSupl="ty"))
## no seq ty ambig seqNa
## [1,] "6" "M" "Nter" NA "6"
## [2,] "11" "C" "inter" NA "11"
## [3,] "12" "N" "Nter" NA "12"
## [4,] "17" "J" "full" NA "17"
## [5,] "18" "A" "full" NA "18"
## [6,] "19" "O" "Nter" NA "19"
## [7,] "7" "B" "Nter" "TRUE" "_7"
## [8,] "10" "D" "full" "TRUE" "_10"
## [9,] "8" "E" "full" "TRUE" "_8"
## [10,] "9" "F" "full" "TRUE" "_9"
## [11,] "13" "G" "Nter" "TRUE" "_13"
## [12,] "3" "H" "Nter" "TRUE" "_3"
# the values from column 'seq' are indeed unique
table(mat2uniq[,"seq"])
##
## A B C D E F G H J M N O
## 1 1 1 1 1 1 1 1 1 1 1 1
# This will return all first repeated (may be >1) but without furter sorting
# along column 'ty' neither marking in comumn 'ambig').
mat2[which(duplicated(mat2[,2],fromLast=FALSE)),]
## no seq ty ambig seqNa
## [1,] "5" "H" "Nter" NA "5"
## [2,] "8" "E" "full" NA "8"
## [3,] "9" "F" "full" NA "9"
## [4,] "13" "G" "Nter" NA "13"
## [5,] "14" "D" "full" NA "14"
## [6,] "15" "D" "full" NA "15"
## [7,] "16" "B" "Nter" NA "16"
## [8,] "20" "F" "full" NA "20"
nonAmbiguousMat(mat1,by=2)
## A B
## 1 1 1
## amb_3 3 2
## amb_6 6 3
Here another example, ambiguous will be marked by an ’_’ :
set.seed(2017); mat3 <- matrix(c(1:100,round(rnorm(200),2)), ncol=3,
dimnames=list(1:100,LETTERS[1:3]));
head(mat3U <- nonAmbiguousMat(mat3, by="B", na="_", uniqO=FALSE), n=15)
## A B C
## 81 81 -2.59 -0.14
## 93 93 -2.02 -0.03
## 7 7 -1.96 0.52
## 4 4 -1.76 0.84
## _74 74 -1.65 0.30
## 55 55 -1.59 1.25
## 52 52 -1.58 -0.24
## 15 15 -1.43 -0.60
## 98 98 -1.34 0.41
## 63 63 -1.33 0.26
## 19 19 -1.13 0.70
## 41 41 -1.06 -0.56
## _56 56 -1.03 -1.07
## 94 94 -0.98 -0.02
## 95 95 -0.97 0.08
head(get1stOfRepeatedByCol(mat3, sortB="B", sortS="B"))
## A B C
## 1 1 1.43 0.02
## 2 2 -0.08 1.38
## 3 3 0.74 -0.07
## 4 4 -1.76 0.84
## 5 5 -0.07 -0.97
## 6 6 0.45 -1.97
lst2 <- list(aa_1x=matrix(1:12, nrow=4, byrow=TRUE), ab_2x=matrix(24:13, nrow=4, byrow=TRUE))
combineReplFromListToMatr(lst2)
## $a
## 1
## [1,] 1
## [2,] 4
## [3,] 7
## [4,] 10
## [5,] 2
## [6,] 5
## [7,] 8
## [8,] 11
## [9,] 3
## [10,] 6
## [11,] 9
## [12,] 12
##
## $b
## 2
## [1,] 24
## [2,] 21
## [3,] 18
## [4,] 15
## [5,] 23
## [6,] 20
## [7,] 17
## [8,] 14
## [9,] 22
## [10,] 19
## [11,] 16
## [12,] 13
The function combineRedundLinesInList()
provides help
for combining/summarizing lines of numeric data which may be summaried
according to reference vector or matrix (part of the same input-list).
Initial data and reference will be aligned based on rownames and the
content of reference (or the column specified by ).
x1 <- list(quant=matrix(11:34, ncol=3, dimnames=list(letters[8:1], LETTERS[11:13])),
annot=matrix(paste0(LETTERS[c(1:4,6,3:5)],LETTERS[c(1:4,6,3:5)]), ncol=1,
dimnames=list(paste(letters[1:8]),"xx")) )
combineRedundLinesInList(lst=x1, refNa="annot", datNa="quant", refColNa="xx")
## $quant
## K L M
## AA 11.0 19.0 27.0
## BB 12.0 20.0 28.0
## CC 14.5 22.5 30.5
## DD 15.5 23.5 31.5
## EE 18.0 26.0 34.0
## FF 15.0 23.0 31.0
##
## $annot
## xx xx2
## [1,] "AA" "a"
## [2,] "BB" "b"
## [3,] "CC" "c,f"
## [4,] "DD" "d,g"
## [5,] "EE" "h"
## [6,] "FF" "e"
mat4 <- matrix(rep(c(1,1:3,3,1),2), ncol=2, dimnames=list(letters[1:6],LETTERS[1:2]))
nonRedundLines(mat4)
## A B
## a 1 1
## c 2 2
## d 3 3
## f 1 1
# input: c and dd are repeated :
filtSizeUniq(list(A="a", B=c("b","bb","c"), D=c("dd","d","ddd","c")), filtUn=TRUE, minSi=NULL)
## filtSizeUniq : 2 out of 8 peptides redundant
## $A
## A
## "a"
##
## $B
## B.1 B.2
## "b" "bb"
##
## $D
## D.1 D.2 D.3
## "dd" "d" "ddd"
# here a,b,c and dd are repeated :
filtSizeUniq(list(A="a", B=c("b","bb","c"), D=c("dd","d","ddd","c")), ref=c(letters[c(1:26,1:3)],
"dd","dd","bb","ddd"), filtUn=TRUE, minSi=NULL)
## filtSizeUniq : 8 out of 8 peptides redundant
## $A
## character(0)
##
## $B
## character(0)
##
## $D
## character(0)
t3 <- data.frame(ref=rep(11:15,3), tx=letters[1:15],
matrix(round(runif(30,-3,2),1), nc=2), stringsAsFactors=FALSE)
# First we split the data.frame in list
by(t3,t3[,1], function(x) x)
## t3[, 1]: 11
## ref tx X1 X2
## 1 11 a 0.4 -1.1
## 6 11 f 0.6 1.0
## 11 11 k 0.1 1.2
## ------------------------------------------------------------
## t3[, 1]: 12
## ref tx X1 X2
## 2 12 b 2.0 -0.4
## 7 12 g -0.3 1.8
## 12 12 l -1.4 0.3
## ------------------------------------------------------------
## t3[, 1]: 13
## ref tx X1 X2
## 3 13 c 0.8 -1.6
## 8 13 h 0.8 -2.4
## 13 13 m 0.9 1.8
## ------------------------------------------------------------
## t3[, 1]: 14
## ref tx X1 X2
## 4 14 d 1.7 0.7
## 9 14 i -1.6 -2.6
## 14 14 n 1.4 -1.1
## ------------------------------------------------------------
## t3[, 1]: 15
## ref tx X1 X2
## 5 15 e -0.9 0.4
## 10 15 j -1.7 0.0
## 15 15 o -1.2 -1.8
t(sapply(by(t3,t3[,1],function(x) x), summarizeCols, me="maxAbsOfRef"))
## ref tx X1 X2
## 11 11 "k" 0.1 1.2
## 12 12 "g" -0.3 1.8
## 13 13 "h" 0.8 -2.4
## 14 14 "i" -1.6 -2.6
## 15 15 "o" -1.2 -1.8
(xt3 <- makeNRedMatr(t3, summ="mean", iniID="ref"))
## makeNRedMatr : Common summarization method 'mean', run as batch
## makeNRedMatr : Summarize redundant based on col 'ref' using method(s) : 'mean', 'mean', 'mean' and 'mean' yielding 4 cols
## ID ref tx X1 X2 nRedLi
## 11 11 11 f 0.3666667 0.3666667 3
## 12 12 12 g 0.1000000 0.5666667 3
## 13 13 13 h 0.8333333 -0.7333333 3
## 14 14 14 i 0.5000000 -1.0000000 3
## 15 15 15 j -1.2666667 -0.4666667 3
(xt3 <- makeNRedMatr(t3, summ=unlist(list(X1="maxAbsOfRef")), iniID="ref"))
## makeNRedMatr : Summarize redundant based on col 'ref' using method(s) : 'maxAbsOfRef' and col 'X1' yielding 4 cols
## ref tx X1 X2 nRedLi
## 11 11 f 0.6 1.0 3
## 12 12 b 2.0 -0.4 3
## 13 13 m 0.9 1.8 3
## 14 14 d 1.7 0.7 3
## 15 15 j -1.7 0.0 3
In the previous example for each subgroup a summarization was calculated. In other cases you may just want to select a line according to values from a single column. Suppose you want to select the line corresponding to the longest transcript.
set.seed(2024)
df2 <- data.frame(transcrID=paste("TrID", 101:124, sep=""), geneID=paste("geID",rep(201:206,each=4)),
geneLe=round(runif(24, min=50, max=1500)), a=101:124, b=224:201)
df2 <- df2[-1*c(1,4:7,12:15),]
(dfLongest <- makeNRedMatr(df2, summ=unlist(list(X1="maxOfRef")), iniID="geneID"))
## makeNRedMatr : Which column to use with 'maxOfRef' not specified, using last numeric 'b'
## makeNRedMatr : Summarize redundant based on col 'geneID' using method(s) : 'maxOfRef' and col 'b' yielding 5 cols
## transcrID geneID geneLe a b nRedLi
## geID 201 TrID102 geID 201 515 102 223 2
## geID 202 TrID108 geID 202 490 108 217 1
## geID 203 TrID109 geID 203 1321 109 216 3
## geID 204 TrID116 geID 204 1271 116 209 1
## geID 205 TrID117 geID 205 210 117 208 4
## geID 206 TrID121 geID 206 119 121 204 4
summarizeCols(df2[1:2,c(1:5,3)], me="min")
## transcrID geneID geneLe a b geneLe.1
## out "TrID102" "geID 201" " 515" "102" "222" " 515"
summarizeCols(df2[1:2,c(1:5,3)], me="mean")
## transcrID geneID geneLe a b geneLe.1
## 2 TrID102 geID 201 776 102.5 222.5 776
summarizeCols(df2[1:2,c(1:5,3)], me="maxOfRef")
## transcrID geneID geneLe a b geneLe.1
## 3 TrID103 geID 201 1037 103 222 1037
summarizeCols(df2[1:6,c(1:5,3)], me="maxOfRef")
## transcrID geneID geneLe a b geneLe.1
## 11 TrID111 geID 203 1358 111 214 1358
(xt3 <- makeNRedMatr(t3, summ=unlist(list(X1="maxOfRef")), iniID=c("ref")))
## makeNRedMatr : Summarize redundant based on col 'ref' using method(s) : 'maxOfRef' and col 'X1' yielding 4 cols
## ref tx X1 X2 nRedLi
## 11 11 f 0.6 1.0 3
## 12 12 b 2.0 -0.4 3
## 13 13 m 0.9 1.8 3
## 14 14 d 1.7 0.7 3
## 15 15 e -0.9 0.4 3
matr <- matrix(c(letters[1:6],"h","h","f","e",LETTERS[1:5]), ncol=3,
dimnames=list(letters[11:15],c("xA","xB","xC")))
combineRedBasedOnCol(matr, colN="xB")
## xA xB xC
## 2 "a,d" "f" "A,D"
## 3 "b,c" "h" "B,C"
## 1 "e" "e" "E"
combineRedBasedOnCol(rbind(matr[1,],matr), colN="xB")
## xA xB xC
## 2 "a,d" "f" "A,D"
## 3 "b,c" "h" "B,C"
## 1 "e" "e" "E"
x <- 1
dat1 <- matrix(1:10, ncol=2)
rownames(dat1) <- letters[c(1:3,2,5)]
## as.data.frame(dat1) ... would result in an error
convMatr2df(dat1)
## ID X1 X2
## a a 1 6
## b_1 b 2 7
## c c 3 8
## b_2 b 4 9
## e e 5 10
convMatr2df(data.frame(a=as.character((1:3)/2), b=LETTERS[1:3], c=1:3))
## a b c
## 1 0.5 A 1
## 2 1.0 B 2
## 3 1.5 C 3
tmp <- data.frame(a=as.character((1:3)/2), b=LETTERS[1:3], c=1:3, stringsAsFactors=FALSE)
convMatr2df(tmp)
## a b c
## 1 0.5 A 1
## 2 1.0 B 2
## 3 1.5 C 3
tmp <- data.frame(a=as.character((1:3)/2), b=1:3, stringsAsFactors=FALSE)
convMatr2df(tmp)
## a b
## 1 0.5 1
## 2 1.0 2
## 3 1.5 3
set.seed(2013)
datT2 <- matrix(round(rnorm(200)+3,1), ncol=2, dimnames=list(paste("li",1:100,sep=""),
letters[23:24]))
# (mimick) some short and longer names for each line
inf2 <- cbind(sh=paste(rep(letters[1:4],each=26), rep(letters,4),1:(26*4),sep=""),
lo=paste(rep(LETTERS[1:4],each=26), rep(LETTERS,4), 1:(26*4), ",",
rep(letters[sample.int(26)],4), rep(letters[sample.int(26)],4), sep=""))[1:100,]
## We'll use this to test :
head(datT2, n=10)
## w x
## li1 2.9 3.7
## li2 3.8 3.3
## li3 2.3 3.3
## li4 4.4 3.1
## li5 4.5 1.8
## li6 0.4 2.4
## li7 3.7 3.3
## li8 3.3 4.0
## li9 5.0 4.1
## li10 1.6 1.1
## let's assign to each pair of x & y values a 'cluster' (column _clu_, the column _combInf_ tells us which lines/indexes are in this cluster)
head(combineOverlapInfo(datT2, disThr=0.03), n=10)
## w x combInf clu isComb
## li1 2.9 3.7 1+16+22+47+91 1 TRUE
## li2 3.8 3.3 2+7+48+54 2 TRUE
## li3 2.3 3.3 3+66+92 3 TRUE
## li4 4.4 3.1 4 52 FALSE
## li5 4.5 1.8 5 53 FALSE
## li6 0.4 2.4 6 54 FALSE
## li7 3.7 3.3 2+7+48+54 2 TRUE
## li8 3.3 4.0 8+100 4 TRUE
## li9 5.0 4.1 9 55 FALSE
## li10 1.6 1.1 10 56 FALSE
## it is also possible to rather display names (eg gene or protein-names) instead of index values
head(combineOverlapInfo(datT2, suplI=inf2[,2], disThr=0.03), n=10)
## w x combInf clu isComb
## li1 2.9 3.7 AA1+AP16+AV22+BU47+DM91 1 TRUE
## li2 3.8 3.3 AB2+AG7+BV48+CB54 2 TRUE
## li3 2.3 3.3 AC3+CN66+DN92 3 TRUE
## li4 4.4 3.1 AD4,ww 52 FALSE
## li5 4.5 1.8 AE5,aj 53 FALSE
## li6 0.4 2.4 AF6,nl 54 FALSE
## li7 3.7 3.3 AB2+AG7+BV48+CB54 2 TRUE
## li8 3.3 4.0 AH8+DV100 4 TRUE
## li9 5.0 4.1 AI9,ic 55 FALSE
## li10 1.6 1.1 AJ10,ee 56 FALSE
dat <- 11:19
names(dat) <- letters[c(6:3,2:4,8,3)]
## Here the names are not unique.
## Thus, the values can be binned by their (non-unique) names and a representative values calculated.
## Let's make a 'datUniq' with the mean of each group of values :
datUniq <- round(tapply(dat, names(dat), mean),1)
## now we propagate the mean values to the full vector
getValuesByUnique(dat, datUniq)
## f e d c b c d h c
## 11.0 12.0 15.0 16.3 15.0 16.3 15.0 18.0 16.3
cbind(ini=dat,firstOfRep=getValuesByUnique(dat, datUniq),
indexUniq=getValuesByUnique(dat, datUniq, asIn=TRUE))
## ini firstOfRep indexUniq
## f 11 11.0 5
## e 12 12.0 4
## d 13 15.0 3
## c 14 16.3 2
## b 15 15.0 1
## c 16 16.3 2
## d 17 15.0 3
## h 18 18.0 6
## c 19 16.3 2
For example, if you wish to create group-labels considering the eye-
and hair-color of a small group students (supposed a sort of controlled
vocabulary was used), the function combineByEitherFactor()
will help. So basically, this is an empiric segmentation-approach for
two categorical variables. Please note, that with large data-sets and
very disperse data this approach will not provide great results. In the
example below we’ll attempt to ‘cluster’ according to columns
nn and qq, the resultant cluster number can be found
in column grp.
nn <- rep(c("a","e","b","c","d","g","f"),c(3,1,2,2,1,2,1))
qq <- rep(c("m","n","p","o","q"),c(2,1,1,4,4))
nq <- cbind(nn,qq)[c(4,2,9,11,6,10,7,3,5,1,12,8),]
## Here we consider 2 columns 'nn' and 'qq' whe trying to regroup common values
## (eg value 'a' from column 'nn' and value 'o' from 'qq')
combineByEitherFactor(nq, 1, 2, nBy=FALSE)
## nn qq grp
## m2 "a" "m" "1"
## q2 "f" "q" "3"
## q1 "d" "q" "3"
## o2 "b" "o" "2"
## p "e" "p" "4"
## o4 "c" "o" "2"
## q4 "g" "q" "3"
## n "a" "n" "1"
## m1 "a" "m" "1"
## q3 "g" "q" "3"
## o1 "b" "o" "2"
## o3 "c" "o" "2"
The argument nBy simply allows adding an additional column with the group/cluster-number.
## the same, but including n by group/cluster
combineByEitherFactor(nq, 1, 2, nBy=TRUE)
## nn qq grp nGrp
## m2 "a" "m" "1" "3"
## q2 "f" "q" "3" "4"
## q1 "d" "q" "3" "4"
## o2 "b" "o" "2" "4"
## p "e" "p" "4" "1"
## o4 "c" "o" "2" "4"
## q4 "g" "q" "3" "4"
## n "a" "n" "1" "3"
## m1 "a" "m" "1" "3"
## q3 "g" "q" "3" "4"
## o1 "b" "o" "2" "4"
## o3 "c" "o" "2" "4"
## Not running further iterations works faster, but you may not reach 'convergence' immediately
combineByEitherFactor(nq,1, 2, nBy=FALSE)
## nn qq grp
## m2 "a" "m" "1"
## q2 "f" "q" "3"
## q1 "d" "q" "3"
## o2 "b" "o" "2"
## p "e" "p" "4"
## o4 "c" "o" "2"
## q4 "g" "q" "3"
## n "a" "n" "1"
## m1 "a" "m" "1"
## q3 "g" "q" "3"
## o1 "b" "o" "2"
## o3 "c" "o" "2"
## another example
mm <- rep(c("a","b","c","d","e"), c(3,4,2,3,1))
pp <- rep(c("m","n","o","p","q"), c(2,2,2,2,5))
combineByEitherFactor(cbind(mm,pp), 1, 2, con=FALSE, nBy=TRUE)
## combineByEitherFactor : did not reach convergence at 2nd pass
## mm pp grp nGrp
## m1 "a" "m" "1" "4"
## m2 "a" "m" "1" "4"
## n1 "a" "n" "1" "4"
## n2 "b" "n" "1" "4"
## o1 "b" "o" "2" "4"
## o2 "b" "o" "2" "4"
## p1 "b" "p" "2" "4"
## p2 "c" "p" "2" "4"
## q1 "c" "q" "3" "5"
## q2 "d" "q" "3" "5"
## q3 "d" "q" "3" "5"
## q4 "d" "q" "3" "5"
## q5 "e" "q" "3" "5"
The function multiCharReplace()
facilitates multiple
replacements in a vector, matrix or data.frame.
# replace character content
x1 <- c("ab","bc","cd","efg","ghj")
multiCharReplace(x1, cbind(old=c("bc","efg"), new=c("BBCC","EF")))
## [1] "ab" "BBCC" "cd" "EF" "ghj"
# works also on matrix and/or to replace numeric content :
x3 <- matrix(11:16, ncol=2)
multiCharReplace(x3, cbind(12:13,112:113))
## [,1] [,2]
## [1,] 11 14
## [2,] 112 15
## [3,] 113 16
Sometimes data get imported using different encoding for what should be interpreted as FALSE and TRUE :
# replace and return logical vactor
x2 <- c("High","n/a","High","High","Low")
multiCharReplace(x2,cbind(old=c("n/a","Low","High"), new=c(NA,FALSE,TRUE)), convTo="logical")
## [1] TRUE NA TRUE TRUE FALSE
The function allows to split (if necessary, using strsplit()) two vectors and compare each isolated tag (eg identifyer) from the 1st vector/object against each isolated tag from the second vector/object. This runs like a loop of one to many comparisons. The basic output is a list with indexes of which element of the 1st vector/object has matches in the 2nd vector/object. Since this is not convenient to the human reader, tabular output can be created, too.
aa <- c("m","k","j; aa","m; aa; bb; o","n; dd","aa","cc")
bb <- c("aa","dd","aa; bb; q","p; cc")
## result as list of indexes
(bOnA <- multiMatch(aa, bb, method="asIndex")) # match bb on aa
## $`1`
## named integer(0)
##
## $`2`
## named integer(0)
##
## $`3`
## aa aa
## 1 3
##
## $`4`
## aa aa bb
## 1 3 3
##
## $`5`
## dd
## 2
##
## $`6`
## aa aa
## 1 3
##
## $`7`
## cc
## 4
## more convenient to the human reader
(bOnA <- multiMatch(aa, bb)) # match bb on aa
## x x.Ind TagBest y.IndBest y.IndAll y.Match y.Adj
## 1 m 1 <NA> NA <NA> <NA> <NA>
## 2 k 2 <NA> NA <NA> <NA> <NA>
## 3 j; aa 3 aa 1 1; 3 aa j; aa
## 4 m; aa; bb; o 4 aa 3 1; 3; 3 aa; bb; q m; aa; bb; o
## 5 n; dd 5 dd 2 2 dd n; dd
## 6 aa 6 aa 1 1; 3 aa aa
## 7 cc 7 cc 4 4 p; cc cc
(bOnA <- multiMatch(aa, bb, method="matchedL")) # match bb on aa
## $`1`
## named integer(0)
##
## $`2`
## named integer(0)
##
## $`3`
## aa aa
## 1 3
##
## $`4`
## aa aa bb
## 1 3 3
##
## $`5`
## dd
## 2
##
## $`6`
## aa aa
## 1 3
##
## $`7`
## cc
## 4
In most programming languages it is fairly easy to compare exact content of character vectors or factors with unordered levels. However, sometimes - due to semantic issues - some people may call a color ‘purple’ while others call it ‘violet’. Thus, without using controled vocabulary the exact terms may vary.
Here, let’s address the case, where no dictionaries of controled vocabulary are available for substituting equivalent terms. Thus, we’ll compare 4 vectors of equal length and check if the words/letters used could be substituted to result in the first vector. Vectors aa and ab have the same global pattern, ie after repeating a word twice it moves to another word. Vectors ac and ad have different general patterns, either with alternating words or falling back to a word previsously used.
Based and extended on a post on stackoverflow https://stackoverflow.com/questions/71353218/extracting-flexible-general-patterns/ :
aa <- letters[rep(c(3:1,4), each=2)]
ab <- letters[rep(c(5,8:6), each=2)] # 'same general' pattern to aa
ac <- letters[c(1:2,1:3,3:4,4)] # NOT 'same general' pattern to any other
ad <- letters[c(6:8,8:6,7:6)] # NOT 'same general' pattern to any other
The basic pattern can be extracted combining match() and unique():
## get global patterns
cbind(aa= match(aa, unique(aa)),
ab= match(ab, unique(ab)),
ac= match(ac, unique(ac)),
ad= match(ad, unique(ad)) )
## aa ab ac ad
## [1,] 1 1 1 1
## [2,] 1 1 2 2
## [3,] 2 2 1 3
## [4,] 2 2 2 3
## [5,] 3 3 3 2
## [6,] 3 3 3 1
## [7,] 4 4 4 2
## [8,] 4 4 4 1
Let’s make a data.frame with the annotation toy-data from above. Each line is supposed to represent a sample, and the columns show different aspects of annotation.
bb <- data.frame(ind=1:length(aa), a=aa, b=ab, c=ac, d=ad)
Via the function replicateStructure()
is it possible to
compare annotation as different columns for equivalent global
patterns.
By default, this function excludes all columns not designating any replicates, like the numbers in the first column ($ind). Also it will try to find the column with the median number of levels, when comparing to all other columns.
The output is a list with $col inidicating which column(s) may be used, $lev for the correpsonding global pattern, $meth for the method finally used and $allCols for documenting the global pattern in each column (whether it was selected or not).
replicateStructure(bb)
## $col
## a
## 2
##
## $lev
## c c b b a a d d
## 1 1 2 2 3 3 4 4
##
## $meth
## [1] "single median col"
Besides, it is also possible to combine all columns if one considers they contribute complementary substructures of the overal annotation.
replicateStructure(bb, method="combAll")
## $col
## a c d
## 2 4 5
##
## $lev
## c_a_f c_b_g b_a_h b_b_h a_c_g a_c_f d_d_g d_d_f
## 1 2 3 4 5 6 7 8
##
## $meth
## [1] "comb all col"
However, when combining multiple columns it may happen -like in the example above- that finally no more lines remain being considered as replicates.
This can also be found when one column describes the groups and another gives the order of the replicates therein. However, for calling a (standard) statistical test it may be necessary exclude these replicate-numbers to designate the groups of replicates.
To overcome the problem of loosing the understanding of replicate-structure when combining all factors, it is possible to look for non-orthogonal structures, ie to try excluding columns which (after combining) would suggest no replicates after combining all columns. See the example below :
replicateStructure(bb, method="combNonOrth")
## $col
## a d
## 2 5
##
## $lev
## c_f c_g b_h b_h a_g a_f d_g d_f
## 1 2 3 3 4 5 6 7
##
## $meth
## [1] "combNonOrth col"
This section addresses values that are not truly identical
but may differ only in the very last digit(s) and thus may be in a
pragmatic view get considered and treated as ‘about the same’. The
simplest approach would be to round values and then look for identical
values. The functions presented here (like
checkSimValueInSer()
) offer this type of search in a
convenient way.
Of course the user must define a threshold for how similar may retained as positive (in the the logical vector returned). With the function checkSimValueInSer() this threshod must be given as ppm (parts per million).
va1 <- c(4:7,7,7,7,7,8:10) + (1:11)/28600
checkSimValueInSer(va1, ppm=5)
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
data.frame(va=sort(va1), simil=checkSimValueInSer(va1))
## va simil
## 1 4.000035 FALSE
## 2 5.000070 FALSE
## 3 6.000105 FALSE
## 4 7.000140 TRUE
## 5 7.000175 TRUE
## 6 7.000210 TRUE
## 7 7.000245 TRUE
## 8 7.000280 TRUE
## 9 8.000315 FALSE
## 10 9.000350 FALSE
## 11 10.000385 FALSE
The search for similar values may be preformed as absolute distance or as ‘ppm’ (as it is eg usual in proteomics when comparing measured and theoretically expected mass).
aA <- c(11:17); bB <- c(12.001,13.999); cC <- c(16.2,8,9,12.5,15.9,13.5,15.7,14.1,5)
(cloMa <- findCloseMatch(x=aA, y=cC, com="diff", lim=0.5, sor=FALSE))
## $x2
## y4
## 0.5
##
## $x3
## y4 y6
## -0.5 0.5
##
## $x4
## y6 y8
## -0.5 0.1
##
## $x6
## y1 y5 y7
## 0.2 -0.1 -0.3
The result of findCloseMatch() is a list organized by each
‘x’, telling all instances of ‘y’ found within the distance tolerance
given by lim. Using closeMatchMatrix()
the result
obtained above, can be presented in a more convenient format for the
human eye.
# all matches (of 2d arg) to/within limit for each of 1st arg ('x'); 'y' ..to 2nd arg = cC
# first let's display only one single closest/best hit
(maAa <- closeMatchMatrix(cloMa, aA, cC, lim=TRUE)) #
## id.aA aA id.cC cC disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 1 2
## [3,] 3 13 6 13.5 -0.5 -37037.0 2 1 2
## [4,] 4 14 8 14.1 -0.1 -7092.2 2 1 1
## [5,] 6 16 5 15.9 0.1 6289.3 3 1 1
Using the argument limitToBest=FALSE we can display all distances within the limits imposed, some values/points may occur multiple times. For example, value number 4 of ‘cC’ (=12.5) or value number 3 of ‘aA’ (=13) now occur multiple times…
(maAa <- closeMatchMatrix(cloMa, aA, cC, lim=FALSE,origN=TRUE)) #
## id.aA aA id.cC cC disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 1 2
## [3,] 3 13 6 13.5 -0.5 -37037.0 2 1 2
## [4,] 4 14 6 13.5 0.5 37037.0 2 0 0
## [5,] 4 14 8 14.1 -0.1 -7092.2 2 1 1
## [6,] 6 16 7 15.7 0.3 19108.0 3 0 0
## [7,] 6 16 5 15.9 0.1 6289.3 3 1 1
## [8,] 6 16 1 16.2 -0.2 -12346.0 3 0 0
(maAa <- closeMatchMatrix(cloMa, cbind(valA=81:87, aA), cbind(valC=91:99, cC), colM=2,
colP=2, lim=FALSE))
## closeMatchMatrix : Reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## id.pred valA aA id.meas valC cC disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 82 12 4 94 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 83 13 4 94 12.5 0.5 40000.0 2 1 2
## [3,] 3 83 13 6 96 13.5 -0.5 -37037.0 2 1 2
## [4,] 4 84 14 6 96 13.5 0.5 37037.0 2 0 0
## [5,] 4 84 14 8 98 14.1 -0.1 -7092.2 2 1 1
## [6,] 6 86 16 7 97 15.7 0.3 19108.0 3 0 0
## [7,] 6 86 16 5 95 15.9 0.1 6289.3 3 1 1
## [8,] 6 86 16 1 91 16.2 -0.2 -12346.0 3 0 0
(maAa <- closeMatchMatrix(cloMa, cbind(aA,valA=81:87), cC, lim=FALSE, deb=TRUE)) #
## closeMatchMatrix : .. xxidentToMatr2a
## closeMatchMatrix : .. xxidentToMatr2c
## closeMatchMatrix : Reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## closeMatchMatrix : .. xxidentToMatr2d
## closeMatchMatrix : .. xxidentToMatr2e
## closeMatchMatrix : .. xxidentToMatr2f
## id.pred aA valA id.meas measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 82 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 83 4 12.5 0.5 40000.0 2 1 2
## [3,] 3 13 83 6 13.5 -0.5 -37037.0 2 1 2
## [4,] 4 14 84 6 13.5 0.5 37037.0 2 0 0
## [5,] 4 14 84 8 14.1 -0.1 -7092.2 2 1 1
## [6,] 6 16 86 7 15.7 0.3 19108.0 3 0 0
## [7,] 6 16 86 5 15.9 0.1 6289.3 3 1 1
## [8,] 6 16 86 1 16.2 -0.2 -12346.0 3 0 0
a2 <- aA; names(a2) <- letters[1:length(a2)]; c2 <- cC; names(c2) <- letters[10 +1:length(c2)]
(cloM2 <- findCloseMatch(x=a2, y=c2, com="diff", lim=0.5, sor=FALSE))
## $b
## n
## 0.5
##
## $c
## n p
## -0.5 0.5
##
## $d
## p r
## -0.5 0.1
##
## $f
## k o q
## 0.2 -0.1 -0.3
(maA2 <- closeMatchMatrix(cloM2, predM=cbind(valA=81:87, a2),
measM=cbind(valC=91:99, c2), colM=2, colP=2, lim=FALSE, asData=TRUE))
## closeMatchMatrix : Reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## id.pred valA a2 id.meas valC c2 disToPred ppmToPred nByGrp isMin nBest
## b b 82 12 n 94 12.5 -0.5 -40000.0 1 1 1
## c_1 c 83 13 n 94 12.5 0.5 40000.0 2 1 2
## c_2 c 83 13 p 96 13.5 -0.5 -37037.0 2 1 2
## d_1 d 84 14 p 96 13.5 0.5 37037.0 2 0 0
## d_2 d 84 14 r 98 14.1 -0.1 -7092.2 2 1 1
## f_1 f 86 16 q 97 15.7 0.3 19108.0 3 0 0
## f_2 f 86 16 o 95 15.9 0.1 6289.3 3 1 1
## f_3 f 86 16 k 91 16.2 -0.2 -12346.0 3 0 0
(maA2 <- closeMatchMatrix(cloM2, cbind(id=names(a2), valA=81:87,a2), cbind(id=names(c2),
valC=91:99,c2), colM=3, colP=3, lim=FALSE, deb=FALSE))
## closeMatchMatrix : Reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## id.pred valA a2 id.meas valC c2 disToPred ppmToPred nByGrp
## b "b" "82" "12" "n" "94" "12.5" "-0.5" "-40000" "1"
## c "c" "83" "13" "n" "94" "12.5" "0.5" "40000" "2"
## c "c" "83" "13" "p" "96" "13.5" "-0.5" "-37037" "2"
## d "d" "84" "14" "p" "96" "13.5" "0.5" "37037" "2"
## d "d" "84" "14" "r" "98" "14.1" "-0.0999999999999996" "-7092.2" "2"
## f "f" "86" "16" "q" "97" "15.7" "0.300000000000001" "19108" "3"
## f "f" "86" "16" "o" "95" "15.9" "0.0999999999999996" "6289.3" "3"
## f "f" "86" "16" "k" "91" "16.2" "-0.199999999999999" "-12346" "3"
## isMin nBest
## b "1" "1"
## c "1" "2"
## c "1" "2"
## d "0" "0"
## d "1" "1"
## f "0" "0"
## f "1" "1"
## f "0" "0"
For comparing two sets of data one may use
findSimilarFrom2sets()
.
aA <- c(11:17); bB <- c(12.001,13.999); cC <- c(16.2,8,9,12.5,12.6,15.9,14.1)
aZ <- matrix(c(aA,aA+20), ncol=2, dimnames=list(letters[1:length(aA)],c("aaA","aZ")))
cZ <- matrix(c(cC,cC+20), ncol=2, dimnames=list(letters[1:length(cC)],c("ccC","cZ")))
findCloseMatch(cC, aA, com="diff", lim=0.5, sor=FALSE)
## $x1
## y6
## -0.2
##
## $x4
## y2 y3
## -0.5 0.5
##
## $x5
## y3
## 0.4
##
## $x6
## y6
## 0.1
##
## $x7
## y4
## -0.1
findSimilFrom2sets(aA, cC)
## aA predMatr cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 0 0
## [3,] 3 13 5 12.6 0.4 31746.0 2 1 1
## [4,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [5,] 6 16 6 15.9 0.1 6289.3 2 1 1
## [6,] 6 16 1 16.2 -0.2 -12346.0 2 0 0
findSimilFrom2sets(cC, aA)
## cC predMatr aA measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 1 16.2 6 16 0.2 12500.0 1 1 1
## [2,] 4 12.5 2 12 0.5 41667.0 2 1 2
## [3,] 4 12.5 3 13 -0.5 -38462.0 2 1 2
## [4,] 5 12.6 3 13 -0.4 -30769.0 1 1 1
## [5,] 6 15.9 6 16 -0.1 -6250.0 1 1 1
## [6,] 7 14.1 4 14 0.1 7142.9 1 1 1
findSimilFrom2sets(aA, cC, best=FALSE)
## aA predMatr cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 0 0
## [3,] 3 13 5 12.6 0.4 31746.0 2 1 1
## [4,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [5,] 6 16 6 15.9 0.1 6289.3 2 1 1
## [6,] 6 16 1 16.2 -0.2 -12346.0 2 0 0
findSimilFrom2sets(aA, cC, comp="ppm", lim=5e4)
## aA predMatr cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1
## [2,] 3 13 4 12.5 0.5 40000.0 2 0 0
## [3,] 3 13 5 12.6 0.4 31746.0 2 1 1
## [4,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [5,] 6 16 6 15.9 0.1 6289.3 2 1 1
## [6,] 6 16 1 16.2 -0.2 -12346.0 2 0 0
## [7,] 7 17 1 16.2 0.8 49383.0 1 1 1
findSimilFrom2sets(aA, cC, comp="ppm", lim=9e4, bestO=FALSE)
## aA predMatr cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 2 1 1
## [2,] 2 12 5 12.6 -0.6 -47619.0 3 0 0
## [3,] 3 13 4 12.5 0.5 40000.0 2 0 0
## [4,] 3 13 5 12.6 0.4 31746.0 3 1 1
## [5,] 3 13 7 14.1 -1.1 -78014.0 3 0 0
## [6,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [7,] 5 15 7 14.1 0.9 63830.0 3 1 2
## [8,] 5 15 6 15.9 -0.9 -56604.0 3 1 2
## [9,] 5 15 1 16.2 -1.2 -74074.0 2 0 0
## [10,] 6 16 6 15.9 0.1 6289.3 3 0 0
## [11,] 6 16 1 16.2 -0.2 -12346.0 2 0 0
## [12,] 7 17 6 15.9 1.1 69182.0 2 1 0
## [13,] 7 17 1 16.2 0.8 49383.0 2 1 2
# below: find fewer 'best matches' since search window larger (ie more good hits compete !)
findSimilFrom2sets(aA, cC, comp="ppm", lim=9e4, bestO=TRUE)
## aA predMatr cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,] 2 12 4 12.5 -0.5 -40000.0 2 1 1
## [2,] 3 13 5 12.6 0.4 31746.0 3 1 1
## [3,] 4 14 7 14.1 -0.1 -7092.2 1 1 1
## [4,] 5 15 7 14.1 0.9 63830.0 3 1 2
## [5,] 5 15 6 15.9 -0.9 -56604.0 3 1 2
## [6,] 7 17 6 15.9 1.1 69182.0 2 1 0
## [7,] 7 17 1 16.2 0.8 49383.0 2 1 2
When you have already identified the closest neighbour of a set of
values, you may want to re-organize/fuse such pairs to a given number of
total clusters (using fusePairs()
).
(daPa <- matrix(c(1:5,8,2:6,9), ncol=2))
## [,1] [,2]
## [1,] 1 2
## [2,] 2 3
## [3,] 3 4
## [4,] 4 5
## [5,] 5 6
## [6,] 8 9
fusePairs(daPa, maxFuse=4)
## 1 2 3 4 4 5 6 8 9
## 1 1 1 1 2 2 2 3 3
When visualizing larger data-sets in an x&y space one may find
many points overlapping when their values are almost the same.
The function elimCloseCoord()
aims to do reduce a bivariate
data-set to ‘non-overlapping’ points, somehow similar to human
perception.
da1 <- matrix(c(rep(0:4,5),0.01,1.1,2.04,3.07,4.5), ncol=2); da1[,1] <- da1[,1]*99; head(da1)
## [,1] [,2]
## [1,] 0 0
## [2,] 99 1
## [3,] 198 2
## [4,] 297 3
## [5,] 396 4
## [6,] 0 0
elimCloseCoord(da1)
## elimCloseCoord : reducing 'x' from 15 to 7 lines
## [,1] [,2]
## 1 0 0.0
## 2 99 1.0
## 3 198 2.0
## 4 297 3.0
## 5 396 4.0
## 12 99 1.1
## 15 396 4.5
Looking for the mode is rather easy with counting data, the
result of table() will get you there quickly. However, with
continuous data the mode may be more tricky to defne and identify.
Intuitively most people consider the mode asthe peak of a density
estimation (which remains to be defined and estimated). With continuous
data most frequent (precise) value may be quite different/distant to the
most dense region of data. The function stableMode()
presented here has different modes of operation, at this point there is
no clear rule which mode may perform most satisfactory in different
situations.
set.seed(2012); dat <- round(c(rnorm(120,0,1.2), rnorm(80,0.8,0.6), rnorm(25,-0.6,0.05), runif(200)),3)
dat <- dat[which(dat > -2 & dat <2)]
stableMode(dat)
## stableMode : Method='density', length of x =406, 'bandw' has been set to 28
## 221
## 0.477
Now we can try to show on a plot :
layout(1:2)
plot(1:length(dat), sort(dat), type="l", main="Sorted Values", xlab="rank", las=1)
abline(h=stableMode(dat, silent=TRUE), lty=2,col=2)
legend("topleft",c("stableMode"), text.col=2, col=2, lty=2, lwd=1, seg.len=1.2, cex=0.8, xjust=0, yjust=0.5)
plot(density(dat, kernel="gaussian", adjust=0.7), xlab="Value of dat", main="Density Estimate Plot")
useCol <- c("red","green","blue","grey55")
legend("topleft",c("dens","binning","BBmisc","allModes"), text.col=useCol, col=useCol,
lty=2, lwd=1, seg.len=1.2, cex=0.8, xjust=0, yjust=0.5)
abline(v=stableMode(dat, method="dens", silent=TRUE), lty=2, col="red", lwd=2)
abline(v=stableMode(dat, method="binning", silent=TRUE), lty=2, col="green")
abline(v=stableMode(dat, method="BBmisc", silent=TRUE), lty=2, col="blue")
## Loading required namespace: BBmisc
abline(v=stableMode(dat, method="allModes"), lty=2, col="grey55")
Please note, that plotting data modelled via a Kernell function (as above) also relies on strong hypothesis which may not be well justified in a number of cases ! For this reason, the sorted values were plotted, too.
As you can see from this example above, looking for the most frequent exact value may not be a perfect choice for continous data. In this example the method ‘allModes’ (ie the multiple instances of most frequent exact values) gave partially usable results (dashed grey lines), due to the rounding to 3 digits. As you can see in the example above, the method ‘allModes’ may give multiple ties ! More rounding will make to data more discrete and ultimately ressemble cunting data. However, with rounding some of the finer resolution/details will get lost.
The function stableMode()
can also be used to locate
the most frequently occuring exact value of numeric or
character vectors. As we just saw at the end of the previous example,
the argument method=“allModes” allows finding all ties (if
present).
set.seed(2021)
x <- sample(letters, 50000, replace=TRUE)
stableMode(dat, method="mode")
## [1] 0.173
stableMode(dat, method="allModes")
## [1] 0.173 0.629 0.676
There are several packages offering interesting functions for manipulating text. Here are a few functions to complement these.
The function protectSpecChar()
allows protecting the
majority of special characters which may influence the outcome of
grep() and related functions so thay they can be used easier in
searches using grep() or alike.
aa <- c("abc","abcde","ab.c","ab.c.e","ab*c","ab\\d")
grepl("b.", aa) # all TRUE
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
grepl(protectSpecChar("b."), aa)
## [1] FALSE FALSE TRUE TRUE FALSE FALSE
Automatic annotation has the tendency to concatenate many parameters
into a single names. The function trimRedundText()
was
designed to allow trimming redundant text from left and/or right side of
a character-vector (when the same portion of text appears in
each element). However, as in some cases (like the first
element of the example below) nothing would remain, it is possible to
define a minimum width for the remaining/resulting text.
txt1 <- c("abcd","abcde","abcdefg","abcdE",NA,"abcdEF")
trimRedundText(txt1)
## [1] "d" "de" "defg" "dE" NA "dEF"
The original idea was to do something resembling the inverse process of trimming redundant text (example above), but this time to discard the variable text.
In the end this is not as trivial when ‘common’ or ‘redundant’ text
is not at the beginning or end of a chain of characters. In particular
with very large text this is an active field of research (eg for
sequence alignment). The function presented here is a very light-weight
solution designed for smaller and simple settings, like inspecting
column-names. Furthermore, the function keepCommonText()
only reports the first (longest) hit. So, when there are multiple
conserved ‘words’ of equal length, only the first of them will be
identified.
When setting the argument ‘hiResol=FALSE’ this function has an option to decrease the resulution of searching, which in turn increases the speed, howevere, at cost of missing the optimal solution. In this case the resultant chain of characters should be inspected if it can be further extended/optimized.
With terminal common text :
txt1 <- c("abcd","abcde","abcdefg","abcdE",NA,"abcdEF")
trimRedundText(txt1, side="left") # remove redundant
## [1] "d" "de" "defg" "dE" NA "dEF"
keepCommonText(txt1, side="terminal") # keep redundant
## [1] "abc"
keepCommonText(txt1, side="center") # computationally easier
## [1] "ab"
With internal coomon text:
txt2 <- c("abcd_abc_kjh", "bcd_abc123", "cd_abc_po")
keepCommonText(txt2, side="center")
## [1] "cd_abc"
Human operators may have many ways to write enumerators like ‘xx_sample_1’, ‘xx_Sample_2’, ‘xx_s3’, ‘xx_4’, etc. Many times you may find such text as names or column-names for measures underneith.
The functions presented below will work only if consistent numerators, ie (text +) digit-character(s) are at the end of all character-strings to be treated.
Please note, that with large vectors testing/checking a larger panel of enumerator-abreviations may result in slower performance. In cases of such larger data-sets it may be more effective to first study the data and then run simple subsitions using sub() targeted for this very case.
The aim of this function consists in identifying a common pattern for terminal enumeratos (ie at end of words/character strings) and to subsequently modify or remove them. As separator-symbols and separator-words are given indedently all combinations thereof may be tested. Furthermore the user has the choice to (automatically) all truncated versions of separator-words (eg Sam instead of Sample).
As basic setting rmEnumeratorName()
allows to identify
and then modify a common terminal enumerator from all elements
of a character string :
xx <- c("hg_Re1","hjRe2_Re2","hk-Re3_Re33")
rmEnumeratorName(xx)
## [1] "hg1" "hjRe22" "hk-Re333"
rmEnumeratorName(xx, newSep="--")
## [1] "hg--1" "hjRe2--2" "hk-Re3--33"
rmEnumeratorName(xx, incl="anyCase")
## rmEnumeratorName : No conistent enumerator+digit combination found; nothing to do ..
## [1] "hg_Re1" "hjRe2_Re2" "hk-Re3_Re33"
Furthermore, this function allows scanning a matrix of text-data and to perform similar operations to the first column found containing a common terminal enumerator.
xy <- cbind(a=11:13, b=c("11#11","2_No2","333_samp333"), c=xx)
rmEnumeratorName(xy)
## a b c
## [1,] "11" "11#11" "hg1"
## [2,] "12" "2_No2" "hjRe22"
## [3,] "13" "333_samp333" "hk-Re333"
rmEnumeratorName(xy,incl=c("anyCase","trim2","rmEnumL"))
## $dat
## a b c
## [1,] "11" "11#11" "hg"
## [2,] "12" "2_No2" "hjRe2"
## [3,] "13" "333_samp333" "hk-Re3"
##
## $column
## [1] 3
##
## $pattern
## [1] "_Re"
If you which to remove/subsitute mutiple types of enumerators the function must be run independently, see last example below.
xz <- cbind(a=11:13, b=c("23#11","4#2","567#333"), c=xx)
apply(xz, 2, rmEnumeratorName, sepEnum=c("","_"), newSep="_", silent=TRUE)
## a b c
## [1,] "11" "23_11" "hg_1"
## [2,] "12" "4_2" "hjRe2_2"
## [3,] "13" "567_333" "hk-Re3_33"
The (slightly older) function unifyEnumerator()
offers
less options, in particular the potential separator-words must be given
explicitly, only lower/upper-case may be kept flexible.
unifyEnumerator(c("ab-1","ab-2","c-3"))
## [1] "ab-" "ab-" "c-"
unifyEnumerator(c("ab-R1","ab-R2","c-R3"))
## [1] "ab-R" "ab-R" "c-R"
unifyEnumerator(c("ab-1","c3-2","dR3"), stringentMatch=FALSE)
## [1] "ab-" "c3-" "dR"
The function checkUnitPrefix
aims to find a
unit-abbreviation or -name occurring in all elements of a
character-vector. The unit name may be preceeded by different decimal
prefixes (eg ‘k’,‘M’, see argument pref) which may vary within
the vector. By default some common SI-units will be searched, see
argument unit.
x1 <- c("10fg WW","xx 10fg 3pW"," 1pg 2.0W")
checkUnitPrefix(x1)
## [1] "g"
## different separators between digit and prefix:
x2 <- c("10fg WW","xx 8_fg 3pW"," 1 pg-2.0W")
checkUnitPrefix(x2, stringentSearch=TRUE)
## NULL
checkUnitPrefix(x2, stringentSearch=FALSE)
## [1] "g"
The function adjustUnitPrefix()
provides help extracting
the numeric part of character vectors and allows adjusting to a single
million-unit type. This can be used to convert a vector of mixed
prefixes like ‘z’,‘a’,‘f’,‘p’,‘n’,‘u’ and ‘m’ (note: the ‘u’ is used for
‘micro’). The output is a character vector with all prefix+unit
expressions adjusted to use the same prefix everywhere
(numeric+separator+prefix+unit). Redundant additional text may get
(optionally) trimmed (see argument returnType=“trim”), the
numeric part names + unit is give in the names.
Please note that decimal/comma digits will not be recognized properly, the function may/will consider the decimal sign as just another separator (as ‘.’ is part af the default selection of separators).
adjustUnitPrefix(c("10.psec", "2 fsec"), unit="sec")
## 10000 2
## "10000 fsec" "2 fsec"
Using the argument returnType you can choose how much of the remaining text should be shown.
x2 <- c("abCc 500_nmol ABC", "abEe5_umol", "", "abFF_100_nmol_G", "abGg 2_mol", "abH.1 mmol")
rbind( adjustUnitPrefix(x2, unit="mol", returnType="allText") ,
adjustUnitPrefix(x2, unit="mol", returnType="trim"),
adjustUnitPrefix(x2, unit="mol", returnType=""))
## adjustUnitPrefix : Ignoring 1 entries not containing 'mol'
## adjustUnitPrefix : Ignoring 1 entries not containing 'mol'
## adjustUnitPrefix : Ignoring 1 entries not containing 'mol'
## 500 5000 <NA> 100
## [1,] "abCc _500_nmol_ ABC" "abEe_5000_nmol_" NA "abFF__100_nmol__G"
## [2,] "Cc _500_nmol_" "Ee_5000_nmol_" NA "FF__100_nmol_"
## [3,] "500_nmol" "5000_nmol" NA "100_nmol"
## 2e+09 1e+06
## [1,] "abGg _2e+09_nmol_" "abH._1e+06_nmol_"
## [2,] "Gg _2e+09_nmol_" "H._1e+06_nmol_"
## [3,] "2e+09_nmol" "1e+06_nmol"
If the unit -name is not knonw in advance, you can try to figure out. In the example it is shown that the unit-name can be determined using the function checkUnitPrefix().
x3 <- c("2.psec abc","300 fsec etc", "34 5fsec")
adjustUnitPrefix(x3, unit=checkUnitPrefix(x3))
## 2000 300 5
## "2000fsec abc" "300fsec etc" "34 5fsec"
The function mergeVectors()
allows merging for multiple
named vectors (each element needs to be named). Basically, all elements
carrying the same name across different input-vectors will be aligned in
the same column of the output (input-vectors appear as lines). Different
to merge() which allows merging only 2 data.frames, here
multiple vectors may be merge at once.
x1 <- c(a=1, b=11, c=21)
x2 <- c(b=12, c=22, a=2)
x3 <- c(a=3, d=43)
mergeVectors(vect1=x1, vect2=x2, vect3=x3)
## a b c d
## vect1 1 11 21 NA
## vect2 2 12 22 NA
## vect3 3 NA NA 43
mergeVectors(vect1=x1, vect2=x2, vect3=x3, inclInfo=TRUE) # return list with additional info
## mergeVectors : Vectors must be longer than 0 and must have names on each element for merging; omit 1 (out of 4) vector(s)
## a b c d
## vect1 1 11 21 NA
## vect2 2 12 22 NA
## vect3 3 NA NA 43
In the example below we’ll add another vector without named elements. As you can see a message tells the this vector was been ignored/omitted.
x4 <- 41:44 # no names - not conform for merging and will be ignored
mergeVectors(x1, x2, x3, x4)
## mergeVectors : Vectors must be longer than 0 and must have names on each element for merging; omit 1 (out of 4) vector(s)
## a b c d
## x.1 1 11 21 NA
## x.2 2 12 22 NA
## x.3 3 NA NA 43
This function allows adjusting the order of lines of a matrix to a reference character-vector , even when initial direct matching of character-strings using is not possible/successful. In this case, various variants of using will be used to see if unambiguous matching is possible of characteristic parts of the text. All columns of will be tested an the column giving the best results will be used.
## Note : columns b and e allow non-ambigous match, not all elements of e are present in a
mat0 <- cbind(a=c("mvvk","axxd","bxxd","vv"),b=c("iwwy","iyyu","kvvh","gxx"), c=rep(9,4),
d=c("hgf","hgf","vxc","nvnn"), e=c("_vv_","_ww_","_xx_","_yy_"))
matchMatrixLinesToRef(mat0[,1:4], ref=mat0[,5])
## aOO1
## ref
## [1,] "_vv_"
## [2,] "_ww_"
## [3,] "_xx_"
## [4,] "_yy_"
matchMatrixLinesToRef(mat0[,1:4], ref=mat0[1:3,5], inclInfo=TRUE)
## aOO1
## $mat
## ref
## [1,] "_vv_"
## [2,] "_ww_"
## [3,] "_xx_"
##
## $byColumn
## [1] 2
##
## $newOrder
## vv ww xx
## 3 1 4
##
## $method
## [1] "grep of ref after trimming redundant text"
matchMatrixLinesToRef(mat0[,-2], ref=mat0[,2], inclInfo=TRUE) # needs 'reverse grep'
## matchMatrixLinesToRef : rmEnumeratorName : Invalid or empty input; nothing to do ..
## matchMatrixLinesToRef : rmEnumeratorName : Invalid or empty input; nothing to do ..
## matchMatrixLinesToRef : rmEnumeratorName : Invalid or empty input; nothing to do ..
## $mat
## a c d e ref
## [1,] "axxd" "9" "hgf" "_ww_" "iwwy"
## [2,] "vv" "9" "nvnn" "_yy_" "iyyu"
## [3,] "mvvk" "9" "hgf" "_vv_" "kvvh"
## [4,] "bxxd" "9" "vxc" "_xx_" "gxx"
##
## $byColumn
## [1] 4
##
## $newOrder
## [1] 2 4 1 3
##
## $method
## [1] "Reverse grep after trimming redundant text"
The function orderMatrToRef()
has the aim of
facilitating brining a matrix of text/data in the order of a given
reference (character vector). This function will try all columns of the
input-matrix to see which gives the best coverage/ highest number of
matches to the reference. If no hits are found, this function will try
by partial matching (using grep()) all entries of the reference
and vice-versa all entries of the matrix.
mat1 <- matrix(paste0("__",letters[rep(c(1,1,2,2,3),3) +rep(0:2,each=5)], rep(1:5)), ncol=3)
orderMatrToRef(mat1, paste0(letters[c(3,4,5,3,4)],c(1,3,5,2,4)))
## $by
## [1] "mat"
##
## $colNo
## [1] 3
##
## $le
## c1 d3 e5 c2 d4
## 1 1 1 1 1
##
## $ord
## [1] 1 4 2 5 3
##
## $mat
## ref
## [1,] "__a1" "__b1" "__c1" "c1"
## [2,] "__b4" "__c4" "__d4" "d3"
## [3,] "__a2" "__b2" "__c2" "e5"
## [4,] "__c5" "__d5" "__e5" "c2"
## [5,] "__b3" "__c3" "__d3" "d4"
mat2 <- matrix(paste0("__",letters[rep(c(1,1,2,2,3),3) +rep(0:2,each=5)], c(rep(1:5,2),1,1,3:5 )), ncol=3)
orderMatrToRef(mat2, paste0(letters[c(3,4,5,3,4)],c(1,3,5,1,4)))
## $by
## [1] "ref"
##
## $colNo
## [1] 3
##
## $le
## c1 d3 e5 c1 d4
## 2 1 1 2 1
##
## $ord
## c1 d3 e5 c1 d4
## 1 3 5 2 4
##
## $mat
## ref
## [1,] "__a1" "__b1" "__c1" "c1"
## [2,] "__b3" "__c3" "__d3" "d3"
## [3,] "__c5" "__d5" "__e5" "e5"
## [4,] "__a2" "__b2" "__c1" "c1"
## [5,] "__b4" "__c4" "__d4" "d4"
mat3 <- matrix(paste0(letters[rep(c(1,1,2,2,3),3) +rep(0:2,each=5)], c(rep(1:5,2),1,1,3,3,5 )), ncol=3)
orderMatrToRef(mat3, paste0("__",letters[c(3,4,5,3,4)],c(1,3,5,1,3)))
## $by
## [1] "mat"
##
## $colNo
## [1] 3
##
## $le
## a1 a2 b3 b4 c5
## 2 2 2 2 1
##
## $ord
## [1] 1 3 5 2 4
##
## $mat
## ref
## [1,] "a1" "b1" "c1" "__c1"
## [2,] "b3" "c3" "d3" "__d3"
## [3,] "c5" "d5" "e5" "__e5"
## [4,] "a2" "b2" "c1" "__c1"
## [5,] "b4" "c4" "d3" "__d3"
Sometimes we need to match terms in concatenated tables. The function
concatMatch()
was designed to behave similar to
match() but also allowing to serach among concatenated terms
and some further text-simplifications.
## simple example without concatenations or text-extensions
x0 <- c("ZZ","YY","AA","BB","DD","CC","D")
tab0 <- c("AA","BB,E","CC","FF,U")
match(x0, tab0)
## [1] NA NA 1 NA NA 3 NA
concatMatch(x0, tab0) # same result as match(), but with names
## ZZ YY AA BB DD CC D
## NA NA 1 2 NA 3 NA
## now let's construct somthing similar but with concatenations and text-extensions
x1 <- c("ZZ","YY","AA","BB-2","DD","CCdef","Dxy") # modif of single ID (no concat)
tab1 <- c("AA","WW,Vde,BB-5,E","CCab","FF,Uef")
match(x1, tab1) # match finds only the 'simplest' case (ie "AA")
## [1] NA NA 1 NA NA NA NA
concatMatch(x1, tab1) # finds all hits as in example above
## ZZ YY AA BB DD CC D
## NA NA 1 2 NA 3 NA
x2 <- c("ZZ,Z","YY,Y","AA,Z,Y","BB-2","DD","X,CCdef","Dxy") # conatenated in 'x'
tab2 <- c("AA","WW,Vde,BB-5,E","CCab,WW","FF,UU")
concatMatch(x2, tab2) # concatenation in both 'x' and 'table'
## ZZ,Z YY,Y AA,Z,Y BB DD X,CC D
## NA NA 1 2 NA 3 NA
Thi function checkStrictOrder()
was designed to scan
each line of an (numeric) input matrix for up- down- or
equal-development, ie the chang to the next value on the right. For
example when working with a matrix of with 4 columns one can look 3
times a the neighbour value following to the right (in the same line),
thus the output will mention 3 events (for each line). If all
counts are ‘up’ and 0 counts are ‘down’ or ‘eq’, the line follows a
permanently increase (not necessarily linear), etc.
In some automated procedures (where the numer of columns of initial input may vary) it may be easier to test if any 0 occur. For this reason the argument invertCount was introduced, in this case a line with a ‘0’ occurring characterizes a constant behaviour (for the respective column).
set.seed(2005); mat1 <- rbind(matrix(round(runif(40),1),nc=4), rep(1,4))
head(mat1)
## [,1] [,2] [,3] [,4]
## [1,] 0.8 0.5 0.2 0.0
## [2,] 0.1 0.6 0.9 1.0
## [3,] 0.9 0.4 0.5 0.8
## [4,] 0.1 0.0 0.2 0.5
## [5,] 0.7 0.6 0.4 0.7
## [6,] 0.7 0.8 0.3 0.7
checkStrictOrder(mat1); mat1[which(checkStrictOrder(mat1)[,2]==0),]
## up down eq
## [1,] 0 3 0
## [2,] 3 0 0
## [3,] 2 1 0
## [4,] 2 1 0
## [5,] 1 2 0
## [6,] 2 1 0
## [7,] 1 2 0
## [8,] 2 1 0
## [9,] 1 2 0
## [10,] 2 1 0
## [11,] 0 0 3
## [,1] [,2] [,3] [,4]
## [1,] 0.1 0.6 0.9 1
## [2,] 1.0 1.0 1.0 1
A slightly more general way of testing can be done using
checkGrpOrder()
. Here, simlpy a logical value will produced
for each line of input indicating if there is constant behaviour. When
the argument revRank=TRUE (default) constant up- or constant
down-characteristics will be tested
head(mat1)
## [,1] [,2] [,3] [,4]
## [1,] 0.8 0.5 0.2 0.0
## [2,] 0.1 0.6 0.9 1.0
## [3,] 0.9 0.4 0.5 0.8
## [4,] 0.1 0.0 0.2 0.5
## [5,] 0.7 0.6 0.4 0.7
## [6,] 0.7 0.8 0.3 0.7
checkGrpOrder(mat1)
## [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
checkGrpOrder(mat1, revRank=FALSE) # only constant 'up' tested
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
In many types of measurments the very low level measures are
delicate. Especially when the readout starts with a baseline signal
before increasing amounts of the analyte start producing a linear
relationship. In such cases some of the very lowest levels of the
analyte are masked by the (random) baseline signal. The function
linModelSelect()
presented here allows omitting some of the
lowest analyte measures to focus on the linear part of the dose-response
relationship.
li1 <- rep(c(4,3,3:6), each=3) + round(runif(18)/5,2)
names(li1) <- paste0(rep(letters[1:5], each=3), rep(1:3,6))
li2 <- rep(c(6,3:7), each=3) + round(runif(18)/5, 2)
dat2 <- rbind(P1=li1, P2=li2)
exp2 <- rep(c(11:16), each=3)
exp4 <- rep(c(3,10,30,100,300,1000), each=3)
## Check & plot for linear model
linModelSelect("P1", dat2, expect=exp2)
## linModelSelect : best slope pVal starting at level no 3
## $coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.952667 0.22137502 -44.95840 7.131292e-13
## conc 1.003000 0.01522206 65.89121 1.578505e-14
##
## $name
## [1] "P1"
##
## $startLev
## [1] 3
linModelSelect("P2", dat2, expect=exp2)
## linModelSelect : best slope pVal starting at level no 2
## $coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.012667 0.17332584 -51.99840 1.806130e-16
## conc 1.008000 0.01231773 81.83325 5.058935e-19
##
## $name
## [1] "P2"
##
## $startLev
## [1] 2
This function was designed for use with rather small data-sets with no (or very few) measures of base-line. When larger panels of data ara available, it may be better to first define a confidence interval for the base-line measurement and then only to consider points outside this confidence interval for regressing dose-response relationships (see also Detection limit).
Once we have run multiple linear regressions on differt parts of the data we might wat to compare them in a single plot. Below, we construct 10 series of data that get modeled the same way, ideally one would obtain a slope close to 1.0. We still allow omitting some starting points, if the resulting model would fit better.
set.seed(2020)
x1 <- matrix(rep(c(2,2:5),each=20) + runif(100) +rep(c(0,0.5,2:3,5),20),
byrow=FALSE, ncol=10, dimnames=list(LETTERS[1:10],NULL))
## just the 1st regression :
summary(lm(b~a, data=data.frame(b=x1[,1], a=rep(1:5,each=2))))
##
## Call:
## lm(formula = b ~ a, data = data.frame(b = x1[, 1], a = rep(1:5,
## each = 2)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3811 -0.6719 -0.5001 1.3683 2.6876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7850 1.3668 2.038 0.0759 .
## a 0.5545 0.4121 1.346 0.2153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.843 on 8 degrees of freedom
## Multiple R-squared: 0.1846, Adjusted R-squared: 0.08263
## F-statistic: 1.811 on 1 and 8 DF, p-value: 0.2153
## all regressions
x1.lmSum <- t(sapply(lapply(rownames(x1), linModelSelect, dat=x1,
expect=rep(1:5,each=2), silent=TRUE, plotGraph=FALSE),
function(x) c(x$coef[2,c(4,1)], startFr=x$startLev)))
x1.lmSum <- cbind(x1.lmSum, medQuantity=apply(x1,1,median))
x1.lmSum[,1] <- log10(x1.lmSum[,1])
head(x1.lmSum)
## Pr(>|t|) Estimate startFr medQuantity
## A -4.298797 0.7837628 1 3.781966
## B -4.828403 1.1542815 2 3.756802
## C -5.873269 0.6638477 1 5.883383
## D -6.518075 0.7793624 1 6.703049
## E -5.288792 0.8599269 1 8.725195
## F -5.031322 0.9901120 2 3.286851
Now we can try to plot :
wrGraphOK <- requireNamespace("wrGraph", quietly=TRUE) # check if package is available
if(wrGraphOK) wrGraph::plotW2Leg(x1.lmSum, useCol=c("Pr(>|t|)","Estimate","medQuantity","startFr"),
legendloc="topleft", txtLegend="start at")
ratioAllComb()
calculates all possible pairwise ratios
between all individual calues of x and y.
set.seed(2014); ra1 <- c(rnorm(9,2,1), runif(8,1,2))
Let’s assume there are 2 parts of ‘x’ for which we would like to know the representative ratio : The ratio of medians does not well reflect the typical ratio (if each element has the same chance to be picked).
median(ra1[1:9]) / median(ra1[10:17])
## [1] 1.327086
Instead, we’ll build all possible ratios and summarize then.
summary( ratioAllComb(ra1[1:9], ra1[10:17]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.359 1.142 1.274 1.290 1.506 2.777
boxplot(list(norm=ra1[1:9], unif=ra1[10:17], rat=ratioAllComb(ra1[1:9],ra1[10:17])))
The main idea of this function is to count frequency of terms when combining different drawings. Suppose, you are asking students for their prefered hobbies. Now, you want to know how many terms will occur in common in groups of 3 students. In the example below, simple letters are shown instead of names of hobbies …
In the simplest way of using combineAsN()
does something
similar to table : Here we’re looking at the full combinatorics
of making groups of nCombin students and let’s count the
frequency of terms found 3 times identical, 2 times or only once (ie not
cited by the others). In case multiple groups of nCombin
students can be formed, the average of the counts, standard error of the
mean (sem), 95% confidence interval (CI) and sd aregiven to resume the
results.
tm1 <- list(a1=LETTERS[1:7], a2=LETTERS[3:9], a3=LETTERS[6:10], a4=LETTERS[8:12])
combineAsN(tm1, nCombin=3, lev=gl(1,4))[,1,]
## n sem CI sd
## sing 5 0.8164966 2.598457 1.632993
## doub 5 0.8164966 2.598457 1.632993
## trip 1 0.5773503 1.837386 1.154701
## min2 10 1.1547005 3.674772 2.309401
## any 11 0.5773503 1.837386 1.154701
One may imagine that different locations/coties/countries will give different results. Thus, we’ll declare the different origins/location using the lev argument. Now, this function focusses (by default) on combinations of students from nCombin different origins/location and counts how many hobbies were mentioned as all different (‘sing’, ie number of hobbies only one student mentioned), single repeat (‘doub’) or three times repeated (‘trip’), plus minumum twice or ‘any’ (ie number of hobies citied no matter how many repeats). The output is an array, the 3rd dimension contains the counts, fllowed by sem, CI and sd.
## different levels/groups in list-elements
tm4 <- list(a1=LETTERS[1:15], a2=LETTERS[3:16], a3=LETTERS[6:17], a4=LETTERS[8:19],
b1=LETTERS[5:19], b2=LETTERS[7:20], b3=LETTERS[11:24], b4=LETTERS[13:25], c1=LETTERS[17:26],
d1=LETTERS[4:12], d2=LETTERS[5:11], d3=LETTERS[6:12], e1=LETTERS[7:10])
te4 <- combineAsN(tm4, nCombin=4, lev=substr(names(tm4),1,1))
## combineAsN : argument 'nCombin' combined to 'remDouble'=TRUE is too high for given data, re-setting to 4
str(te4)
## num [1:5, 1:7, 1:4] 4 4 3 8 19 ...
## - attr(*, "dimnames")=List of 3
## ..$ : chr [1:5] "sing" "doub" "trip" "min2" ...
## ..$ : chr [1:7] "a_a_a_a" "a_b_c_d" "a_b_c_e" "a_b_d_e" ...
## ..$ : chr [1:4] "n" "sem" "CI" "sd"
te4[,,1] # the counts part only
## a_a_a_a a_b_c_d a_b_c_e a_b_d_e a_c_d_e b_b_b_b b_c_d_e
## sing 4 6.562500 7.5000 7.437500 15.333333 3 10.333333
## doub 4 12.583333 12.5625 6.708333 4.166667 8 9.666667
## trip 3 4.395833 2.8750 3.520833 3.750000 3 2.000000
## min2 8 19.145833 20.0625 14.145833 19.500000 11 20.000000
## any 19 23.541667 22.9375 19.541667 23.250000 21 22.000000
Some software do produce a series of csv files, where a large
experiment/data-set get recorded as multiple files. The function
readCsvBatch()
was designed for reading multiple csv files
of exactly the same layout and to join their content. As output a list
with the content of each file can be produced (one matrix per file), or
the data may be fused into an array, as shown below.
path1 <- system.file("extdata", package="wrMisc")
fiNa <- c("pl01_1.csv","pl01_2.csv","pl02_1.csv","pl02_2.csv")
datAll <- readCsvBatch(fiNa, path1, silent=TRUE)
str(datAll)
## num [1:96, 1:4, 1] 158808 174272 183176 175752 49272 ...
## - attr(*, "dimnames")=List of 3
## ..$ : chr [1:96] "A01" "B01" "C01" "D01" ...
## ..$ : chr [1:4] "1_1.csv" "1_2.csv" "2_1.csv" "2_2.csv"
## ..$ : chr "StainA"
When setting the first argument fileNames to NULL, you can read all files of a given path.
## batch reading of all csv files in specified path :
datAll2 <- readCsvBatch(fileNames=NULL, path=path1, silent=TRUE)
str(datAll2)
## num [1:96, 1:4, 1] 158808 174272 183176 175752 49272 ...
## - attr(*, "dimnames")=List of 3
## ..$ : chr [1:96] "A01" "B01" "C01" "D01" ...
## ..$ : chr [1:4] "1_1.csv" "1_2.csv" "2_1.csv" "2_2.csv"
## ..$ : chr "StainA"
The function readTabulatedBatch()
allows fast batch
reading of tabulated files. All files specified (or all files from a
given directory) will be read into separate data.frames of a list.
Default options are US-style comma, automatic testing for head in case
the package data.table is available (otheriwse : no header).
Furthermore it is possible to design a given (numeric) column and
directly filter for all lines passing a given threshold, allowing to get
smaller objects.
path1 <- system.file("extdata", package="wrMisc")
fiNa <- c("a1.txt","a2.txt")
allTxt <- readTabulatedBatch(fiNa, path1)
str(allTxt)
## List of 2
## $ a1.txt:'data.frame': 33 obs. of 3 variables:
## ..$ V1: int [1:33] 3697 3626 732 388503 10747 1564 3699 256394 345 3950 ...
## ..$ V2: num [1:33] 4 6.24 6.63 6.71 8 ...
## ..$ V3: num [1:33] 0.621 0.507 0.575 0.502 0.525 ...
## $ a2.txt:'data.frame': 35 obs. of 3 variables:
## ..$ V1: int [1:35] 6414 57381 8404 10580 79611 4739 10252 221395 4256 4811 ...
## ..$ V2: num [1:35] 1.73 5.83 6.71 7.48 9.49 ...
## ..$ V3: num [1:35] 0.412 0.407 0.391 0.368 0.348 ...
Sometimes were may get confronted with data which look like
‘incomplete’ tables. In such cases some rows do not contain as many
elements/columns as other columns. Files with this type of data may pose
a problem for read.table()
(from the utils
package). In some cases using the argument fill=TRUE may allow
to overcome this problem. The function readVarColumns() (from
this package) was designed to provide better help in such odd cases.
Basically, each line is read and parsed separately, the user should
check/decide on the separator to be used.
The example below lists people’s names in different locations, some
locations have more persons … Sometimes exporting such data will
generate shorter lines in locations with fewer elements (here ‘London’)
and no additional separators will get added (to mark all empty fields)
towards the end. The function readVarColumns()
(from this
package) provides help to read such data, if the content (and
separators) of the last columns are missing.
path1 <- system.file("extdata", package="wrMisc")
fiNa <- "Names1.tsv"
datAll <- readVarColumns(fiName=file.path(path1,fiNa), sep="\t")
## readVarColumns : Setting 'refCo' to 'Location'
str(datAll)
## chr [1:2, 1:5] "Paris" "London" "Caroline" "James" "Marie" "Stella" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2] "Paris" "London"
## ..$ : chr [1:5] "Location" "Names" "Names_2" "Names_3" ...
In this example readVarColumns() would give a warning (and column-names are not recognized), if you use the argument header=TRUE you’ll get an error and nothing gets read.
GitHub allows sharing code and (to
a lower degree) data. In order to properly read tabulated (txt, tsv or
csv) data directly from a given url, the user should switch to the ‘Raw’
view. The function gitDataUrl()
allows to conventiently
switch any url (on git) to the format from ‘Raw view’, suitable for
directly reading the data using read.delim() ,
read.table() or read.csv() etc …).
## An example url with tabulated data :
url1 <- "https://github.com/bigbio/proteomics-metadata-standard/blob/master/annotated-projects/PXD001819/PXD001819.sdrf.tsv"
gitDataUrl(url1)
## [1] "https://raw.githubusercontent.com/bigbio/proteomics-metadata-standard/master/annotated-projects/PXD001819/PXD001819.sdrf.tsv"
The example below shows how this is used in the function readSampleMetaData() in wrProteo.
dataPxd <- try(read.delim(gitDataUrl(url1), sep='\t', header=TRUE))
str(dataPxd)
## 'data.frame': 27 obs. of 24 variables:
## $ source.name : chr "Sample 1" "Sample 1" "Sample 1" "Sample 2" ...
## $ characteristics.organism. : chr "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" ...
## $ characteristics.organism.part. : chr "not available" "not available" "not available" "not available" ...
## $ characteristics.disease. : chr "not available" "not available" "not available" "not available" ...
## $ characteristics.cell.type. : chr "not applicable" "not applicable" "not applicable" "not applicable" ...
## $ characteristics.mass. : chr "2 mg" "2 mg" "2 mg" "2 mg" ...
## $ characteristics.spiked.compound. : chr "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group" ...
## $ characteristics.biological.replicate.: int 1 1 1 1 1 1 1 1 1 1 ...
## $ material.type : chr "lysate" "lysate" "lysate" "lysate" ...
## $ assay.name : chr "run 1" "run 2" "run 3" "run 4" ...
## $ technology.type : chr "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" ...
## $ comment.label. : chr "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" ...
## $ comment.instrument. : chr "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" ...
## $ comment.precursor.mass.tolerance. : chr "5 ppm" "5 ppm" "5 ppm" "5 ppm" ...
## $ comment.fragment.mass.tolerance. : chr "0.8 Da" "0.8 Da" "0.8 Da" "0.8 Da" ...
## $ comment.cleavage.agent.details. : chr "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" ...
## $ comment.modification.parameters. : chr "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" ...
## $ comment.modification.parameters..1 : chr "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" ...
## $ comment.modification.parameters..2 : chr "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" ...
## $ comment.technical.replicate. : int 1 2 3 1 2 3 1 2 3 1 ...
## $ comment.fraction.identifier. : int 1 1 1 1 1 1 1 1 1 1 ...
## $ comment.file.uri. : chr "https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R1.raw" "https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R2.raw" "https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_12500amol_R3.raw" "https://ftp.ebi.ac.uk/pride-archive/2015/12/PXD001819/UPS1_125amol_R1.raw" ...
## $ comment.data.file. : chr "UPS1_12500amol_R1.raw" "UPS1_12500amol_R2.raw" "UPS1_12500amol_R3.raw" "UPS1_125amol_R1.raw" ...
## $ factor.value.spiked.compound. : chr "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group" ...
The main reason of normalization is to remove variability in the data which is not directly linked to the (original/biological) concept of a given experiment. High throughput data from real world measurements may easily contain various deformations due to technical reasons, eg slight temperature variations, electromagnetic interference, instability of reagents etc. In particular, transferring constant amounts of liquids/reagents in highly repeated steps over large experiments is often also very challenging, small variations of the amounts of liquid (or similar) are typically addressed by normalization. However, applying aggressive normalization to the data also brings considerable risk of starting to loose some of the effects one intended to study. At some point it may rather be better to eliminate a few samples or branches of an experiment to avoid too invasive intervention. This shows that quality control can be tightly linked to decisions about data-normalization. In conclusion, normalization may be far more challenging than simply running some algorithms..
In general, the use has to assume/define some hypothesis to justify intervention. Sometimes specific elements of an experiment are known to be not affected and can therefore be used to normalize the rest. Eg, if you observe growth of trees in a forest, big blocks of rock on the floor are assumed no to change their location. So one could use them as alignment-marks to superpose pictures taken at slightly different positions.
The hypothesis of no global changes is very common : During the course of many biological experiments (eg change of nutrient) one assumes that only a small portion of the elements measured (eg the abundance of all different gene-products) do change, since many processes of a living cell like growth, replication and interaction with neighbour-cells are assumed not to be affected. So, if one assumes that there are no global changes one normalizes the input-data in a way that the average or median across each experiment will give the same value. In analogy, if one takes photographs on a partially cloudy day, most cameras will adjust light settings (sun r clouds) so that global luminosity stays the same. However, if too many of the measured elements are affected, this normalization approach will lead to (additional) loss of information.
It is essential to understand the type of deformation(s) data may suffer from in order to choose the appropriate approacges for normalization. Of course, graphical representations (PCA, MA-plots, etc) are extremely important to identifying abnormalities and potential problems. The package wrGraph offers also complementary options useful in the context of normalization. Again, graphical representation(s) of the data help to visualize how different normalization procedures affect outcomes.
Before jumping into normalization it may be quite useful to filter the data first. The overall idea is, that most high-throughput experiments do produce some non-meaningful data (artefacts) and it may be wise to remove such ‘bad’ data first, as they may effect normalization (in particular extreme values). A special case of problematic data concerns NA-values.
Frequent NA-values may represent another potential issue. With NA-values there is no general optimal advice. To get started, you should try to investigate how and why NA-values occurred to check if there is a special ‘meaning’ to them. For example, on some measurement systems values below detection limit may be simply reported as NAs. If the lines of your data represent different features quantified (eg proteins), than lines with mostly NA-values represent features that may not be well exploited anyway. Therefore many times one tries to filter away lines of ‘bad’ data. Of course, if there is a column (sample) with an extremely high content of NAs, one should also investigate what might be particular with this column (sample), to see if one might be better of to eliminate the entire column.
Please note, that imputing NA-values represents another option instead of filtering and removing, multiple other packages address this in detail, too. All decisions of which approach to use should be data-driven.
Filter for each group of columns for sufficient data as non-NA The
function presenceGrpFilt()
allows to
dat1 <- matrix(1:56,ncol=7)
dat1[c(2,3,4,5,6,10,12,18,19,20,22,23,26,27,28,30,31,34,38,39,50,54)] <- NA
grp1 <- gl(3,3)[-(3:4)]
dat1
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1 9 17 25 33 41 49
## [2,] NA NA NA NA NA 42 NA
## [3,] NA 11 NA NA 35 43 51
## [4,] NA NA NA NA 36 44 52
## [5,] NA 13 21 29 37 45 53
## [6,] NA 14 NA NA NA 46 NA
## [7,] 7 15 NA NA NA 47 55
## [8,] 8 16 24 32 40 48 56
## now let's filter
presenceGrpFilt(dat1, gr=grp1, presThr=0.75) # stringent
## 1 2 3
## [1,] TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE TRUE
## [4,] FALSE FALSE TRUE
## [5,] FALSE TRUE TRUE
## [6,] FALSE FALSE FALSE
## [7,] TRUE FALSE FALSE
## [8,] TRUE TRUE TRUE
presenceGrpFilt(dat1, gr=grp1, presThr=0.25) # less stringent
## 1 2 3
## [1,] TRUE TRUE TRUE
## [2,] FALSE FALSE TRUE
## [3,] TRUE FALSE TRUE
## [4,] FALSE FALSE TRUE
## [5,] TRUE TRUE TRUE
## [6,] TRUE FALSE TRUE
## [7,] TRUE FALSE TRUE
## [8,] TRUE TRUE TRUE
If you want to use your data in a pair-wise view (like running
t-tests on each line) the function presenceFilt()
allows to
eliminate lines containing too many NA-values for each
pair-wise combination of the groups/levles.
presenceFilt(dat1, gr=grp1, maxGr=1, ratM=0.1)
## presenceFilt : correcting 'maxGrpMiss' for group(s) 1, 2 and 3 due to ratMaxNA=0.1
## 1-2 1-3 2-3
## [1,] TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE
## [3,] FALSE TRUE TRUE
## [4,] FALSE TRUE TRUE
## [5,] TRUE TRUE TRUE
## [6,] FALSE FALSE FALSE
## [7,] TRUE TRUE FALSE
## [8,] TRUE TRUE TRUE
presenceFilt(dat1, gr=grp1, maxGr=2, rat=0.5)
## presenceFilt : correcting 'maxGrpMiss' for group(s) 1 and 2 due to ratMaxNA=0.5
## 1-2 1-3 2-3
## [1,] TRUE TRUE TRUE
## [2,] FALSE TRUE TRUE
## [3,] TRUE TRUE TRUE
## [4,] FALSE TRUE TRUE
## [5,] TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE
## [7,] TRUE TRUE TRUE
## [8,] TRUE TRUE TRUE
This procedures aims to remove (by setting to as NA) the most extreme of noisy replicates. Thus, it is assumed that all columns of the input matrix (or data.frame) are replicates of the other columns. The nOutl most distant points are identified and will be set to NA.
(mat3 <- matrix(c(19,20,30,40, 18,19,28,39, 16,14,35,41, 17,20,30,40), ncol=4))
## [,1] [,2] [,3] [,4]
## [1,] 19 18 16 17
## [2,] 20 19 14 20
## [3,] 30 28 35 30
## [4,] 40 39 41 40
cleanReplicates(mat3, nOutl=1)
## cleanReplicates : rownames of 'x' either NULL or not unique, replacing by row-numbers
## cleanReplicates : removing 1 entries in lines 2
## [,1] [,2] [,3] [,4]
## 1 19 18 16 17
## 2 20 19 NA 20
## 3 30 28 35 30
## 4 40 39 41 40
cleanReplicates(mat3, nOutl=3)
## cleanReplicates : rownames of 'x' either NULL or not unique, replacing by row-numbers
## cleanReplicates : removing 3 entries in lines 1,2,3
## [,1] [,2] [,3] [,4]
## 1 NA 18 16 17
## 2 20 19 NA 20
## 3 30 28 NA 30
## 4 40 39 41 40
In biological high-throughput data columns typically represent different samples, which may be organized as replicates. During high-throughput experiments thousands of (independent) elements are measured (eg abundance of gene-products), they are represented by rows. As real-world experiments are not always as perfect as we may think, small changes in the signal measured may easily happen. Thus, the aim of normalizing is to remove or reduce any trace/variability in the data not related to the original experiement but due to imperfections during detection.
Note, that some experiments may produce a considerable amount of missing data (NAs) which require special attention (dedicated developments exist in other R-packages eg in wrProteo). My general advice is to first carefully look where such missing data is observed and to pay attention to replicate measurements where a given element once was measured with a real numeric value and once as missing information (NA).
set.seed(2015); rand1 <- round(runif(300) +rnorm(300,0,2),3)
dat1 <- cbind(ser1=round(100:1 +rand1[1:100]), ser2=round(1.2*(100:1 +rand1[101:200]) -2),
ser3=round((100:1 +rand1[201:300])^1.2-3))
dat1 <- cbind(dat1, ser4=round(dat1[,1]^seq(2,5,length.out=100) +rand1[11:110],1))
## Let's introduce some NAs
dat1[dat1 <1] <- NA
## Let's get a quick overview of the data
summary(dat1)
## ser1 ser2 ser3 ser4
## Min. : 2.00 Min. : 1.00 Min. : 1.0 Min. : 37.5
## 1st Qu.: 26.75 1st Qu.: 28.00 1st Qu.: 50.0 1st Qu.: 67210.0
## Median : 49.50 Median : 59.00 Median :109.0 Median : 332524.0
## Mean : 51.14 Mean : 58.79 Mean :115.1 Mean : 542279.4
## 3rd Qu.: 76.25 3rd Qu.: 89.50 3rd Qu.:173.5 3rd Qu.: 925759.5
## Max. :100.00 Max. :121.00 Max. :263.0 Max. :2123191.7
## NA's :1
## some selected lines (indeed, the 4th column appears always much higher)
dat1[c(1:5,50:54,95:100),]
## ser1 ser2 ser3 ser4
## [1,] 100 121 251 10000.6
## [2,] 100 117 244 11500.2
## [3,] 99 120 263 12948.1
## [4,] 99 120 242 14885.1
## [5,] 97 114 236 16382.3
## [6,] 51 60 111 892534.1
## [7,] 48 58 109 812490.4
## [8,] 49 55 108 982907.4
## [9,] 50 56 107 1188787.7
## [10,] 45 47 102 915343.6
## [11,] 3 6 5 206.4
## [12,] 5 2 7 2570.0
## [13,] 8 1 3 27125.8
## [14,] 6 4 5 6975.9
## [15,] 3 3 1 237.0
## [16,] 2 1 NA 37.5
Our toy data may be normalized by a number of different criteria. In real applications the nature of the data and the type of deformation detected/expected will largely help deciding which normalization might be the ‘best’ choice. Here we’ll try first normalizing by the mean, ie all columns will be forced to end up with the same column-mean. The trimmed mean does not consider values at extremes (as outliers are frequently artefacts and display extreme values). When restricting even stronger which values to consider one will eventually end up with the median (3rd method used below).
no1 <- normalizeThis(dat1, refGrp=1:3, meth="mean")
no2 <- normalizeThis(dat1, refGrp=1:3, meth="trimMean", trim=0.4)
no3 <- normalizeThis(dat1, refGrp=1:3, meth="median")
no4 <- normalizeThis(dat1, refGrp=1:3, meth="slope", quantFa=c(0.2,0.8))
It is suggested to verify normalization results by plots. Note, that Box plots may not be appropriate in some cases (eg multimodal distributions), for displaying more details you may consider using Violin-Plots from packages vioplot or wrGraph, another option might be a (cumulated) frequency plot (eg in package wrGraph).
You can see clearly, that the 4th data-set has a problem of range. So we’ll see if some proportional normalization may help to make it more comparable to the other ones.
The standard approach for normalizing relies on consisting all
columns as collections of data who’s distribution is not supposed to
change. In some cases/projects we may want to formulate a much more
‘aggressive’ hypothesis : We consider the content of all columns
strictly as the same. For example this may be the case when comparing
with technical replicates only. In such cases one may use the function
rowNormalize()
which tries to find the average or mean
optimal within-line normalization factor.
Besides, an additional mode of operation for sparse data has been added : Basically, once a row contains just one NA, this row can’t be used any more to derive a normalization factor for all rows. Thus, with many NA-values the number of ‘complete’ rows will be low or even 0 redering this approach inefficient or impossible. Once the content of NA-values is above a customizable threshold, the data will be broken in smaller subsets with fewer groups of fewer columns, thus increasing the chances of finding ‘complete’ subsets of data which will be normalized first and added to other subsets in later steps.
This approach relies on the hypothesis that all data in a given line should be (aproximately) the same value ! Thus, this procedure is particularly well adopted to the case when all samples are multiple replicate measurements of the same sample.
set.seed(2); AA <- matrix(rbinom(110, 10, 0.05), nrow=10)
AA[,4:5] <- AA[,4:5] *rep(4:3, each=nrow(AA))
AA1 <- rowNormalize(AA)
round(AA1, 2)
## 1 2 3 4 5 6 7 8 9 10 11
## [1,] 0.65 0.87 2.13 0.34 3.21 0.48 2.04 0.97 0.74 4.34 0.82
## [2,] 1.72 0.87 0.80 0.34 0.29 0.48 2.04 0.97 1.98 1.00 0.82
## [3,] 0.65 2.33 2.13 2.59 0.29 1.27 2.04 0.97 0.74 1.00 3.57
## [4,] 0.65 0.87 0.80 2.59 0.29 2.06 0.77 0.97 0.74 1.00 2.20
## [5,] 2.80 0.87 0.80 0.34 3.21 0.48 2.04 0.97 0.74 1.00 0.82
## [6,] 2.80 2.33 0.80 2.59 1.75 1.27 0.77 2.59 1.98 1.00 0.82
## [7,] 0.65 3.79 0.80 2.59 3.21 1.27 0.77 0.97 3.21 1.00 2.20
## [8,] 1.72 0.87 0.80 0.34 0.29 2.85 0.77 2.59 0.74 1.00 0.82
## [9,] 0.65 0.87 3.47 2.59 0.29 1.27 0.77 0.97 1.98 1.00 0.82
## [10,] 0.65 0.87 0.80 0.34 1.75 1.27 0.77 0.97 0.74 1.00 0.82
Now, let’s make this sparse and try normalizing:
AC <- AA
AC[which(AC <1)] <- NA
(AC1 <- rowNormalize(AC))
## 1 2 3 4 5 6 7 8 9
## [1,] NA NA 2.343543 NA 3.870829 NA NaN NA NA
## [2,] 1.388298 NA NA NA NA NA NaN NA 3.597222
## [3,] NA 2 2.343543 3.499743 NA 2.663721 NaN NA NA
## [4,] NA NA NA 3.499743 NA 5.327441 NA NA NA
## [5,] 2.776596 NA NA NA 3.870829 NA NaN NA NA
## [6,] 2.776596 2 NA 3.499743 1.935414 2.663721 NA 2.25 3.597222
## [7,] NA 4 NA 3.499743 3.870829 2.663721 NA NA 7.194444
## [8,] 1.388298 NA NA NA NA 7.991162 NA 2.25 NA
## [9,] NA NA 4.687086 3.499743 NA 2.663721 NA NA 3.597222
## [10,] NA NA NA NA 1.935414 2.663721 NA NA NA
## 10 11
## [1,] 2.583333 NA
## [2,] NA NA
## [3,] NA 3.407407
## [4,] NA 1.703704
## [5,] NA NA
## [6,] NA NA
## [7,] NA 1.703704
## [8,] NA NA
## [9,] NA NA
## [10,] NA NA
Like with normalizeThis() we can define some reference-lines (only these lines will be considered to determine normalization-factors)
(AC3 <- rowNormalize(AC, refLines=1:5, omitNonAlignable=TRUE))
## 1 2 3 4 5 6 7 8 9 10 11
## [1,] NA NA 1.909091 NA 1.750 NA NaN NA NA 1.75 NA
## [2,] NaN NA NA NA NA NA NaN NA NaN NA NA
## [3,] NA 2 1.909091 2.409836 NA 2.684932 NaN NA NA NA 4.083333
## [4,] NA NA NA 2.409836 NA 5.369863 NA NA NA NA 2.041667
## [5,] NaN NA NA NA 1.750 NA NaN NA NA NA NA
## [6,] NaN 2 NA 2.409836 0.875 2.684932 NA NaN NaN NA NA
## [7,] NA 4 NA 2.409836 1.750 2.684932 NA NA NaN NA 2.041667
## [8,] NaN NA NA NA NA 8.054795 NA NaN NA NA NA
## [9,] NA NA 3.818182 2.409836 NA 2.684932 NA NA NaN NA NA
## [10,] NA NA NA NA 0.875 2.684932 NA NA NA NA NA
Please note, that the iterative procedure for sparse data may consume large amounts of computational resources, in particular when a small number of subgroups has been selected.
Sometimes one needs to obtain the coordinates of values/points of a
matrix according to a given filtering condition. The standard approach
using which() gives only a linearized index but not
row/column, which is sufficient for replacing indexed values. If you
need to know the true row/column indexes, you may use
coordOfFilt()
.
set.seed(2021); ma1 <- matrix(sample.int(n=40, size=27, replace=TRUE), ncol=9)
## let's test which values are >37
which(ma1 >37) # doesn't tell which row & col
## [1] 2 3 6 7 9 14 26
coordOfFilt(ma1, ma1 >37)
## row col
## [1,] 2 1
## [2,] 3 1
## [3,] 3 2
## [4,] 1 3
## [5,] 3 3
## [6,] 2 5
## [7,] 2 9
Under certain circumstances using the trimmed mean may be more stable
towards outlyer values. The base function mean()
allows to
preform symmetric trimming, thus with more trimming the result will
start converging to the mean. The function
trimmedMean()
gives more flexible options for assinging
different upper- and lower fractions of the initial data to be trimmed.
When outlyer values always appear at the high (or low) end of data, such
proceeding may be useful. However, the user is encouraged to use this
with caution since there is also a risk of introducing bias.
x <- c(17:11,27:28)
mean(x)
## [1] 17
mean(x, trim=0.15) # symmetric trimming
## [1] 16.28571
mean(x[x < 25]) # manual trimming
## [1] 14
trimmedMean(x, trim=c(l=0, u=0.7)) # asymmetric trim
## [1] 13.5
When creating random values to an expected mean and
sd, the results ontained using the standard function
rnorm()
may deviate somehow from the expected mean and sd,
in particular with low n. To still produce random values
fitting closely to the expected mean and sd you may
use the function rnormW()
. The case of n=2 is
quite simple with one possible results. In other cases
(n>2), there will be a random initiation which can be fixed
using the argument seed.
## some sample data :
x1 <- (11:16)[-5]
mean(x1); sd(x1)
## [1] 13.2
## [1] 1.923538
## the standard way for gerenating normal random values
ra1 <- rnorm(n=length(x1), mean=mean(x1), sd=sd(x1))
## In particular with low n, the random values deviate somehow from expected mean and sd :
mean(ra1) -mean(x1)
## [1] -1.103347
sd(ra1) -sd(x1)
## [1] 0.3920622
## random numbers with close fit to expected mean and sd :
ra2 <- rnormW(length(x1), mean(x1), sd(x1))
mean(ra2) -mean(x1)
## [1] 0
sd(ra2) -sd(x1) # much closer to expected value
## [1] -4.440892e-16
Thus, the second data-sets fits even with few n very well to the global characteristics defined/expected.
If you are not familiar with the way data is handled in the
Bioconductor package limma
and you would like to use some of the tools for running moderated
t-tests therein, this will provide easy access using
moderTest2grp()
:
set.seed(2017); t8 <- matrix(round(rnorm(1600,10,0.4),2), ncol=8,
dimnames=list(paste("l",1:200), c("AA1","BB1","CC1","DD1","AA2","BB2","CC2","DD2")))
t8[3:6,1:2] <- t8[3:6,1:2]+3 # augment lines 3:6 for AA1&BB1
t8[5:8,5:6] <- t8[5:8,5:6]+3 # augment lines 5:8 for AA2&BB2 (c,d,g,h should be found)
t4 <- log2(t8[,1:4]/t8[,5:8])
fit4 <- moderTest2grp(t4, gl(2,2))
## now we'll use limma's topTable() function to look at the 'best' results
if("list" %in% mode(fit4)) { # if you have limma installed we can look further
library(limma)
topTable(fit4, coef=1,n=5) # effect for 3,4,7,8
fit4in <- moderTest2grp(t4, gl(2,2), testO="<")
if("list" %in% mode(fit4in)) topTable(fit4in, coef=1,n=5) }
## moderTest2grp : Testing alternative hypothesis: true difference in means is less than 0 (ie focus on 101 results with A less than B)
## logFC AveExpr t P.Value adj.P.Val B
## l 7 -0.4975806 -0.2436786 -8.712092 3.994695e-17 7.989390e-15 30.668381
## l 4 0.4020373 0.1890232 7.039234 1.000000e+00 1.000000e+00 17.723883
## l 8 -0.3735170 -0.2259811 -6.539873 9.417239e-11 9.417239e-09 14.392733
## l 3 0.3508834 0.1488240 6.143585 1.000000e+00 1.000000e+00 11.923522
## l 27 -0.1348878 -0.1011609 -2.361738 9.333949e-03 6.222633e-01 -3.878176
If you want to make multiple pair-wise comparisons using
moderTestXgrp()
:
grp <- factor(rep(LETTERS[c(3,1,4)], c(2,3,3)))
set.seed(2017); t8 <- matrix(round(rnorm(208*8,10,0.4),2), ncol=8,
dimnames=list(paste(letters[], rep(1:8,each=26),sep=""), paste(grp,c(1:2,1:3,1:3),sep="")))
t8[3:6,1:2] <- t8[3:6,1:2] +3 # augment lines 3:6 (c-f)
t8[5:8,c(1:2,6:8)] <- t8[5:8,c(1:2,6:8)] -1.5 # lower lines
t8[6:7,3:5] <- t8[6:7,3:5] +2.2 # augment lines
## expect to find C/A in c,d,g, (h)
## expect to find C/D in c,d,e,f
## expect to find A/D in f,g,(h)
test8 <- moderTestXgrp(t8, grp)
head(test8$p.value, n=8)
## A-C A-D C-D
## a1 8.736828e-02 6.776543e-02 9.397304e-01
## b1 4.384118e-01 5.400019e-01 8.205610e-01
## c1 1.094834e-19 6.344497e-01 2.571471e-21
## d1 2.671725e-13 9.915692e-01 2.858699e-13
## e1 1.802454e-03 2.413137e-08 9.735465e-16
## f1 3.188362e-01 2.527208e-32 2.226490e-22
## g1 1.166242e-29 6.410057e-33 5.484445e-01
## h1 1.141181e-05 1.943795e-05 5.674938e-01
To get an introduction into local false discovery rate estimations
you may read Strimmer 2008.
A convenient way to get lfdr values calculated by the package fdrtool is
available via the function pVal2lfdr()
.
Note, that the toy-example used below is too small for estimating meaningful lfdr values. For this reason the function fdrtool() from package fdrtool will issue warnings.
set.seed(2017); t8 <- matrix(round(rnorm(160,10,0.4),2), ncol=8, dimnames=list(letters[1:20],
c("AA1","BB1","CC1","DD1","AA2","BB2","CC2","DD2")))
t8[3:6,1:2] <- t8[3:6,1:2] +3 # augment lines 3:6 (c-f) for AA1&BB1
t8[5:8,5:6] <- t8[5:8,5:6] +3 # augment lines 5:8 (e-h) for AA2&BB2 (c,d,g,h should be found)
head(pVal2lfdr(apply(t8, 1, function(x) t.test(x[1:4], x[5:8])$p.value)))
## Warning in fdrtool::fdrtool(z, statistic = "pvalue", plot = FALSE, verbose =
## !silent): There may be too few input test statistics for reliable FDR
## calculations!
## a b c d e f
## 1.0000000 0.5753562 0.5753562 1.0000000 1.0000000 1.0000000
The confindence
interval (CI) is a common way of describing the uncertainity of
measured or estimated values. The function confInt()
allows
calculating the confidence interval of the mean (using the functions
qt() and sd()) under a given significance
level (alpha). assuming that the Normal
distribution is valid.
set.seed(2022); ran <- rnorm(50)
confInt(ran, alpha=0.05)
## [1] 0.248199
## plot points and confindence interval of mean
plot(ran, jitter(rep(1, length(ran))), ylim=c(0.95, 1.05), xlab="random variable 'ran'",main="Points and Confidence Interval of Mean (alpha=0.05)", ylab="", las=1)
points(mean(ran), 0.97, pch=3, col=4) # mean
lines(mean(ran) +c(-1, 1) *confInt(ran, 0.05), c(0.97, 0.97), lwd=4, col=4) # CI
legend("topleft","95% conficence interval of mean", text.col=4,col=4,lty=1,lwd=1,seg.len=1.2,cex=0.9,xjust=0,yjust=0.5)
When running multiple pairwise tests (using moderTestXgrp())
the column-names are concatenated group-names. To get the index of which
group has been used in which pair-wise set you may use the function
matchSampToPairw()
, as shown below.
## make example if limma is not installed
if(!requireNamespace("limma", quietly=TRUE)) test8 <- list(FDR=matrix(1, nrow=2, ncol=3, dimnames=list(NULL,c("A-C","A-D","C-D"))))
matchSampToPairw(unique(grp), colnames(test8$FDR))
## le ri
## A-C 2 1
## A-D 2 3
## C-D 1 3
When running multiple pairwise tests (using moderTestXgrp())
the results will be in adjacent columns and the group-names reflected in
the column-names. In the case measurements from multiple levels of a
given variable are compared it is useful to extract the numeric part,
the function numPairDeColNames()
provides support to do so.
When extracting just the numeric part, unit names will get lost, though.
Note, if units used are not constant (eg seconds and milliseconds mixed)
the extracted numeric values do not reflect the real quantitative
context any more.
mat1 <- matrix(1:8, nrow=2, dimnames=list(NULL, paste0(1:4,"-",6:9)))
numPairDeColNames(mat1)
## numPartDeColNames : PROBLEM ? : 'stripTxt' does REMOVE the separator 'sep' ! Select a different separator or 'stripTxt' strategy to resolve pairwise combinations !
## index log2rat conc1 conc2
## [1,] 1 2.585 1 6
## [2,] 2 1.807 2 7
## [3,] 3 1.415 3 8
## [4,] 4 1.170 4 9
In order to run statistical testing the user must know which sample
should be considered replicate of whom. The function ()
aims to provide help by checking all column of a matrix of meta-data
with the aim of identifying the replicate-status.
To do so, all columns are examined how many groups of replicats they may design. Depending on the argumen method various options for choosing automatically exist : The default method=“combAll” will select the column with the median number of groups (not counting all-different or all-same columns)). When using as method=“combAll” (ie combine all columns that are neither all-different nor all-same), there is risk all lines (samples) will be be considered different and no replicates remain. To avoid this situation the argument -method_ can be set to “combNonOrth”. Then, it will be checked if adding more columns will lead to complete loss of replicates, and -if so- concerned columns omitted.
## column a is all different, b is groups of 2,
## c & d are groups of 2 nut NOT 'same general' pattern as b
strX <- data.frame(a=letters[18:11], b=letters[rep(c(3:1,4), each=2)],
c=letters[rep(c(5,8:6), each=2)], d=letters[c(1:2,1:3,3:4,4)],
e=letters[rep(c(4,8,4,7),each=2)], f=rep("z",8) )
strX
## a b c d e f
## 1 r c e a d z
## 2 q c e b d z
## 3 p b h a h z
## 4 o b h b h z
## 5 n a g c d z
## 6 m a g c d z
## 7 l d f d g z
## 8 k d f d g z
replicateStructure(strX[,1:2])
## $col
## b
## 2
##
## $lev
## c c b b a a d d
## 1 1 2 2 3 3 4 4
##
## $meth
## [1] "single informative col"
replicateStructure(strX[,1:4], method="combAll")
## $col
## b d
## 2 4
##
## $lev
## c_a c_b b_a b_b a_c a_c d_d d_d
## 1 2 3 4 5 5 6 6
##
## $meth
## [1] "comb all col"
replicateStructure(strX[,1:4], method="combAll", exclNoRepl=FALSE)
## $col
## a
## 1
##
## $lev
## 1 2 3 4 5 6 7 8
## 1 2 3 4 5 6 7 8
##
## $meth
## [1] "(shortcut, first) single col at max divergence"
##
## $allCols
## a b c d
## [1,] 1 1 1 1
## [2,] 2 1 1 2
## [3,] 3 2 2 1
## [4,] 4 2 2 2
## [5,] 5 3 3 3
## [6,] 6 3 3 3
## [7,] 7 4 4 4
## [8,] 8 4 4 4
replicateStructure(strX[,1:4], method="combNonOrth", exclNoRepl=TRUE)
## $col
## b d
## 2 4
##
## $lev
## c_a c_b b_a b_b a_c a_c d_d d_d
## 1 2 3 4 5 5 6 6
##
## $meth
## [1] "combNonOrth col"
replicateStructure(strX, method="lowest")
## $col
## e
## 5
##
## $lev
## d d h h d d g g
## 1 1 2 2 1 1 3 3
##
## $meth
## [1] "single min col"
Multiple concepts for clustering have been deeveloped, most of them allow extracting a vector with the cluster-numbers. Here some functions helping to work with the output of such clustering results are presented.
The way how to prepare data for clustering may be as important as the choice of the actual clustering-algorithm …
Many clustering algorithms are available in R (eg see also CRAN Task View: Cluster
Analysis & Finite Mixture Models), many of them require the
input data to be standardized. The regular way of standardizing sets all
elements to mean=0 and sd=1. To do so, the function scale()
may be used.
dat <- matrix(2*round(runif(100),2), ncol=4)
mean(dat); sd(dat)
## [1] 1.0348
## [1] 0.5991349
datS <- scale(dat)
apply(datS, 2, sd)
## [1] 1 1 1 1
# each column was teated separately
mean(datS); sd(datS); range(datS)
## [1] 1.274615e-17
## [1] 0.9847319
## [1] -1.898224 1.708967
# the mean is almost 0.0 and the sd almost 1.0
datB <- scale(dat, center=TRUE, scale=FALSE)
mean(datB); sd(datB); range(datB) # mean is almost 0
## [1] 4.435522e-18
## [1] 0.5815165
## [1] -1.2096 0.8984
However, if you want the entire data-set and not each column
sparately, you may use standardW()
. Thus, relative
differences visible within a line will be conserved. Furthermore, in
case of 3-dim arrays, this function returns also the same dimensions as
the input.
datS2 <- standardW(dat)
apply(datS2, 2, sd)
## [1] 1.1773030 0.9158595 0.9519728 0.8688335
summary(datS2)
## V1 V2 V3 V4
## Min. :-1.6938 Min. :-1.6270 Min. :-1.6604 Min. :-1.5602
## 1st Qu.:-1.0929 1st Qu.:-0.5922 1st Qu.:-0.9594 1st Qu.:-1.0595
## Median : 0.9767 Median : 0.2757 Median :-0.3585 Median :-0.4587
## Mean : 0.3251 Mean : 0.1115 Mean :-0.1289 Mean :-0.3078
## 3rd Qu.: 1.3106 3rd Qu.: 0.8098 3rd Qu.: 0.7431 3rd Qu.: 0.3425
## Max. : 1.5442 Max. : 1.6110 Max. : 1.2438 Max. : 1.1770
mean(datS2); sd(datS2)
## [1] 1.046597e-16
## [1] 1
datS3 <- standardW(dat, byColumn=TRUE)
apply(datS3, 2, sd)
## [1] 0.849399 1.091871 1.050450 1.150969
summary(datS3)
## V1 V2 V3 V4
## Min. :-1.7149 Min. :-1.97112 Min. :-1.6439 Min. :-1.5952
## 1st Qu.:-1.0060 1st Qu.:-1.05991 1st Qu.:-0.7672 1st Qu.:-0.6347
## Median :-0.1696 Median : 0.01531 Median : 0.2672 Median : 0.4987
## Mean :-0.2762 Mean :-0.12174 Mean : 0.1354 Mean : 0.3542
## 3rd Qu.: 0.4613 3rd Qu.: 0.82628 3rd Qu.: 1.0474 3rd Qu.: 1.3536
## Max. : 1.0922 Max. : 1.63725 Max. : 1.8276 Max. : 2.2084
mean(datS3); sd(datS3)
## [1] 0.022922
## [1] 1.065665
Sometimes it is sufficient to only set the minimum and maximum to a given range.
datR2 <- apply(dat, 2, scaleXY, 1, 100)
summary(datR2); sd(datR2)
## V1 V2 V3 V4
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.: 19.37 1st Qu.: 32.64 1st Qu.: 24.90 1st Qu.: 19.11
## Median : 82.65 Median : 59.18 Median : 45.38 Median : 40.84
## Mean : 62.73 Mean : 54.15 Mean : 53.21 Mean : 46.30
## 3rd Qu.: 92.86 3rd Qu.: 75.51 3rd Qu.: 82.93 3rd Qu.: 69.82
## Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00
## [1] 32.14382
Here a very basic clustering example…
nGr <- 3
irKm <- stats::kmeans(iris[,1:4], nGr, nstart=nGr*4) # no need to standardize
table(irKm$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 0 48 14
## 2 0 2 36
## 3 50 0 0
#wrGraph::plotPCAw(t(as.matrix(iris[,1:4])), sampleGrp=irKm,colBase=irKm$cluster,useSymb=as.numeric(as.factor(iris$Species)))
Using the function reorgByCluNo()
we can now ‘apply’ the
clustering result to the initial data to obtain other information.
## sort results by cluster number
head(reorgByCluNo(iris[,-5], irKm$cluster))
## Sepal.Length Sepal.Width Petal.Length Petal.Width index geoMean cluNo
## 118 7.7 3.8 6.7 2.2 118 4.557146 1
## 110 7.2 3.6 6.1 2.5 110 4.458884 1
## 132 7.9 3.8 6.4 2.0 132 4.427465 1
## 136 7.7 3.0 6.1 2.3 136 4.242945 1
## 119 7.7 2.6 6.9 2.3 119 4.221922 1
## 106 7.6 3.0 6.6 2.1 106 4.216232 1
tail(reorgByCluNo(iris[,-5], irKm$cluster))
## Sepal.Length Sepal.Width Petal.Length Petal.Width index geoMean cluNo
## 23 4.6 3.6 1.0 0.2 23 1.349033 3
## 33 5.2 4.1 1.5 0.1 33 1.337272 3
## 38 4.9 3.6 1.4 0.1 38 1.253593 3
## 10 4.9 3.1 1.5 0.1 10 1.228605 3
## 13 4.8 3.0 1.4 0.1 13 1.191578 3
## 14 4.3 3.0 1.1 0.1 14 1.091429 3
Let’s calculate the median and sd values for each cluster:
## median an CV
ir2 <- reorgByCluNo(iris[,-5], irKm$cluster, addInfo=FALSE, retList=TRUE)
sapply(ir2, function(x) apply(x, 2, median))
## 1 2 3
## Sepal.Width 2.8 3.00 3.4
## Petal.Length 4.5 5.65 1.5
## Petal.Width 1.4 2.10 0.2
sapply(ir2, colSds)
## Warning in colSums(matrix(as.numeric(!is.na(t(dat))), ncol = nrow(dat)) * :
## longer object length is not a multiple of shorter object length
## Warning in colSums(matrix(as.numeric(!is.na(t(dat))), ncol = nrow(dat)) * :
## longer object length is not a multiple of shorter object length
## Warning in colSums(matrix(as.numeric(!is.na(t(dat))), ncol = nrow(dat)) * :
## longer object length is not a multiple of shorter object length
## $`1`
## 51 52 54 55 56 57 58
## 0.07001335 0.06000809 0.07828623 0.02853218 0.02287431 0.08354720 0.15709795
## 59 60 61 62 63 64 65
## 0.03700469 0.06364329 0.15923773 0.04151695 0.10273735 0.04399074 0.10485173
## 66 67 68 69 70 71 72
## 0.04523536 0.03599045 0.06735713 0.07202384 0.08265631 0.09082568 0.05363288
## 73 74 75 76 77 79 80
## 0.07271801 0.04979830 0.02853218 0.03251681 0.05263765 0.02518504 0.12859209
## 81 82 83 84 85 86 87
## 0.09794136 0.11384682 0.07020193 0.09312551 0.03599045 0.08717141 0.06031574
## 88 89 90 91 92 93 94
## 0.05991990 0.05238589 0.06200184 0.03547245 0.04189733 0.06161683 0.16120159
## 95 96 97 98 99 100 102
## 0.03076184 0.05048383 0.03584321 0.02853218 0.18621084 0.04183417 0.10854413
## 107 114 115 120 122 124 127
## 0.04856173 0.11088184 0.15338493 0.10502810 0.09748133 0.08025421 0.07035242
## 128 134 139 143 147 150
## 0.08625665 0.09108731 0.07709505 0.10854413 0.10296872 0.10685009
##
## $`2`
## 53 78 101 103 104 105 106
## 0.16732675 0.13693850 0.09030345 0.02903775 0.05785108 0.02620532 0.14163626
## 108 109 110 111 112 113 116
## 0.10589162 0.10474356 0.12618014 0.10821691 0.09923436 0.04187571 0.08444247
## 117 118 119 121 123 125 126
## 0.06096385 0.19876132 0.20908513 0.04354104 0.16419450 0.03814256 0.06491885
## 129 130 131 132 133 135 136
## 0.05091989 0.07895804 0.07922781 0.16153057 0.05495068 0.13704234 0.07088912
## 137 138 140 141 142 144 145
## 0.07967540 0.05990466 0.05660830 0.05906771 0.11215440 0.05021665 0.08003167
## 146 148 149
## 0.09749905 0.09069640 0.08635798
##
## $`3`
## 1 2 3 4 5 6
## 0.015080735 0.062129555 0.040492882 0.047626095 0.026933024 0.078655020
## 7 8 9 10 11 12
## 0.012408029 0.009415575 0.076230585 0.051575979 0.039781033 0.021162153
## 13 14 15 16 17 18
## 0.065206736 0.082751657 0.090118290 0.140693907 0.074606998 0.015612658
## 19 20 21 22 23 24
## 0.063558426 0.053973538 0.034859485 0.044981629 0.070731455 0.052981321
## 25 26 27 28 29 30
## 0.063042584 0.064577749 0.029810284 0.013358480 0.011731694 0.038635925
## 31 32 33 34 35 36
## 0.051258449 0.023010202 0.098389480 0.110835786 0.047626095 0.050049771
## 37 38 39 40 41 42
## 0.026164316 0.033424908 0.065705589 0.009415575 0.026474478 0.162978902
## 43 44 45 46 47 48
## 0.040492882 0.055244170 0.084990276 0.062260808 0.057061370 0.034387943
## 49 50
## 0.039781033 0.021354157
Besides, we have already seen the function
cutArrayInCluLike()
in section Working with Arrays ‘Working with
Arrays’.
In some some circumstances clusters/groups with very vew individuals
are not productive during further evalualtions (in particular in the
context of interaction-netwoks). The function rmOrphans()
allows identifying and modifying cluster- or group-assignments of such
very small groups (‘orphans’).
In the example below a vector of cluster-assignments (‘x’) is treated by different options to remove orphans :
x=c(3:1,3:4,4:6,5:3)
cbind(x, def=rmOrphans(x), assign1=rmOrphans(x, reassign=TRUE),
assign1=rmOrphans(x, minN=0.2, reassign=TRUE) )
## x def assign1 assign1
## [1,] 3 3 3 3
## [2,] 2 NA 3 3
## [3,] 1 NA 3 3
## [4,] 3 3 3 3
## [5,] 4 4 4 4
## [6,] 4 4 4 4
## [7,] 5 5 5 4
## [8,] 6 NA 5 4
## [9,] 5 5 5 4
## [10,] 4 4 4 4
## [11,] 3 3 3 3
When interogating network-databases (like String for proteins or the
coexpressionDB for gene co-expression) typically a (semi-)quantitatve
value is supplied with the connection of node ‘A’ to node ‘B’.
In many cases, it may be useful to filter the initial query-output to
retain only strong interactions. Furthermore, it may be of interest to
expand such networks by nodes allowing to (further) inter-connect
initial query-nodes (so called ‘Sandwich’ nodes as they are in the
middle of initial nodes), for such nodes a separate (eg even more
stringent) threshold can be applied.
Here let’s suppose nodes have 3-digit names (ie numbers). 7 nodes of an initial query gave 1 to 7 conected nodes, the results are presented as list of data.frames where the 1st column is the connected node and the 2nd column the quality score of the connection (edge). Furthemore, let’s assume that here lower scores are better.
lst2 <- list('121'=data.frame(ID=as.character(c(141,221,228,229,449)),11:15),
'131'=data.frame(ID=as.character(c(228,331,332,333,339)),11:15),
'141'=data.frame(ID=as.character(c(121,151,229,339,441,442,449)),c(11:17)),
'151'=data.frame(ID=as.character(c(449,141,551,552)),11:14),
'161'=data.frame(ID=as.character(171),11),
'171'=data.frame(ID=as.character(161),11),
'181'=data.frame(ID=as.character(881:882),11:12) )
Now, we’d like to keep the core network consisting of all (dirctly) interconnected nodes with scores below 20 :
(nw1 <- filterNetw(lst2, limInt=20, sandwLim=NULL, remOrphans=FALSE))
## filterNetw : Invalid entry for 'filtCol': should be integer (of length=1) to designate which column to use or column-name; setting to 2
## filterNetw : 2 element(s) had no data remaining after filtering ...
## filterNetw : .filterNetw : Removing 3 (reverse) redundant mappings
## Node1 Node2 edgeScore toSandw
## 1 121 141 11 FALSE
## 2 141 151 12 FALSE
## 3 161 171 11 FALSE
In the resulting output the 1st column now represents the query-nodes, the 2nd column all connected nodes based on filtering scores for edges, and the 3rd colum the score for the edges.
Let’s also remove all nodes not connected to a backbone at least 3 nodes long, ie remove orphan pairs of nodes :
(nw2 <- filterNetw(lst2, limInt=20, sandwLim=NULL, remOrphans=TRUE))
## filterNetw : Invalid entry for 'filtCol': should be integer (of length=1) to designate which column to use or column-name; setting to 2
## filterNetw : 2 element(s) had no data remaining after filtering ...
## filterNetw : .filterNetw : Removing 3 (reverse) redundant mappings
## Node1 Node2 edgeScore toSandw
## 1 121 141 11 FALSE
## 2 141 151 12 FALSE
If you want to expand this network by nodes allowing to further interconnect the nodes from above, we can add all ‘sandwich’ nodes (let’s use a threshold of inferior/equal to 14 which will use only the better ‘sandwich’-edges) :
(nw3 <- filterNetw(lst2, limInt=20, sandwLim=14, remOrphans=TRUE))
## filterNetw : Invalid entry for 'filtCol': should be integer (of length=1) to designate which column to use or column-name; setting to 2
## filterNetw : 1 element(s) had no data remaining after filtering ...
## filterNetw : .filterNetw : Removing 3 (reverse) redundant mappings
## Node1 Node2 edgeScore toSandw
## 1 121 141 11 FALSE
## 2 121 228 13 TRUE
## 3 121 229 14 TRUE
## 4 131 228 11 TRUE
## 5 141 151 12 FALSE
## 6 141 229 13 TRUE
Many times networks get created from pairs of nodes. One way to represent the full network is via propensisty matrixes. Several advanced tools and packages rather accept such propensisty matrixes as input. Here, it is assumed that each line of the input represents a separate pair of nodes connected by an edge.
pairs3L <- matrix(LETTERS[c(1,3,3, 2,2,1)], ncol=2) # loop of 3
(netw13pr <- pairsAsPropensMatr(pairs3L)) # as prop matr
## 1 2 3
## 1 0 1 1
## 2 1 0 1
## 3 1 1 0
path1 <- matrix(c(17,19,18,17, 4,4,2,3), ncol=2,
dimnames=list(c("A/B/C/D","A/B/G/D","A/H","A/H/I"), c("sumLen","n")))
contribToContigPerFrag(path1)
## sumLe n.frag len.rat
## A 19 4 1.000
## B 19 4 1.000
## C 17 4 0.895
## D 19 4 1.000
## G 19 4 1.000
## H 18 2 0.947
## I 17 3 0.895
If you have a set of fragments from a common ancestor and the fragment’s start- and end-sites are marked by index-positions (integers), you can make a simple graphical display :
frag1 <- cbind(beg=c(2,3,7,13,13,15,7,9,7, 3,3,5), end=c(6,12,8,18,20,20,19,12,12, 4,5,7))
rownames(frag1) <- letters[1:nrow(frag1)]
simpleFragFig(frag1)
Now we can make a matrix telling if some fragments do start or end at exactely the same position.
countSameStartEnd(frag1)
## beg end beg.n beg.rat end.n end.rat
## a 2 6 NA NA NA NA
## b 3 12 3 0.2500 3 0.2500
## c 7 8 3 0.2500 NA NA
## d 13 18 2 0.1667 NA NA
## e 13 20 2 0.1667 2 0.1667
## f 15 20 NA NA 2 0.1667
## g 7 19 3 0.2500 NA NA
## h 9 12 NA NA 3 0.2500
## i 7 12 3 0.2500 3 0.2500
## j 3 4 3 0.2500 NA NA
## k 3 5 3 0.2500 NA NA
## l 5 7 NA NA NA NA
The function pasteC()
allows adding quotes and
separating the last element by specific text (eg ‘and’).
pasteC(1:4)
## [1] "1, 2, 3 and 4"
pasteC(letters[1:4],quoteC="'")
## [1] "'a', 'b', 'c' and 'd'"
By default most color-gradients end with a color very close to the beginning.
set.seed(2015); dat1 <- round(runif(15),2)
plot(1:15, dat1, pch=16, cex=2, las=1, col=colorAccording2(dat1),
main="Color gradient according to value in y")
# Here we modify the span of the color gradient
plot(1:15, dat1, pch=16, cex=2, las=1,
col=colorAccording2(dat1, nStartO=0, nEndO=4, revCol=TRUE), main="blue to red")
# It is also possible to work with scales of transparency
plot(1:9, pch=3, las=1)
points(1:9, 1:9, col=transpGraySca(st=0, en=0.8, nSt=9,trans=0.3), cex=42, pch=16)
For this purpose you may use convColorToTransp
.
col0 <- c("#998FCC","#5AC3BA","#CBD34E","#FF7D73")
col1 <- convColorToTransp(col0,alph=0.7)
layout(1:2)
pie(rep(1,length(col0)), col=col0, main="no transparency")
pie(rep(1,length(col1)), col=col1, main="new transparency")
There are many ways of creating reports. If you want simply to
combine a few plots into a pdf, the function tableToPlot()
may be helpful to add a small table (eg overview of points/samples/files
used in other plots of the same pdf). This function prints tables in the
current graphical output/window (which may by a pdf-device).
Many times it may be useful to add the date to filenames when saving data or plots as files. The built-in functions date(), Sys.Date() and Sys.Time() are a good way to start.
Generally I like to use abbreviated month-names since the order of writing the month is different in Europe compared to the USA. So, this may help avoiding mis-interpreting dates instead of writing the number of the Month. For example, 2021-03-05 means in Europe March 5th while in other places it means May 3rd.
You may also look at the standardized format for numeric dates norm ISO 8601, which matches to Sys.Date() (from package base) and sysDate(style=“univ5”) (package wrMisc) .
The R-functions mentioned above (date(),
Sys.Date(), from the base package) use local language
settings. For using English languange names you may use the function
sysDate
from this package (ie wrProteo). It allows
producing compact versions of current the date, independent to
local language settings (or not -if you prefer), ie
locale-specific, (yes, in some languages - like French - the first 3
letters of the month may give ambiguous results !) and to avoid white
space ’ ’ (which I prefer to avoid in file-names). Please look at the
function’s help-page for all available options.
## To get started
Sys.Date()
## [1] "2024-08-20"
## Compact English names (in European order), no matter what your local settings are :
sysDate()
## [1] "20aug24"
The table below shows a number of options to write the date in English or using local month-names :
tabD <- cbind(paste0("univ",1:6), c(sysDate(style="univ1"), sysDate(style="univ2"),
sysDate(style="univ3"), sysDate(style="univ4"), as.character(sysDate(style="univ5")),
sysDate(style="univ6")), paste0(" local",1:6),
c(sysDate(style="local1"), sysDate(style="local2"), sysDate(style="local3"),
sysDate(style="local4"), sysDate(style="local5"), sysDate(style="local6")))
knitr::kable(tabD, caption="Various ways of writing current date")
univ1 | 20aug24 | local1 | 20aoû24 |
univ2 | 20Aug24 | local2 | 20Aoû24 |
univ3 | 20August2024 | local3 | 20Août2024 |
univ4 | 20august2024 | local4 | 20août2024 |
univ5 | 2024-08-20 | local5 | 20-août-2024 |
univ6 | 2024-233 | local6 | 2024août20 |
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=French_France.utf8
## [3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
## [5] LC_TIME=French_France.utf8
##
## time zone: Europe/Paris
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] limma_3.60.3 knitr_1.47 wrMisc_1.15.2
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 jsonlite_1.8.8 dplyr_1.1.4 compiler_4.4.0
## [5] highr_0.11 Rcpp_1.0.12 tidyselect_1.2.1 stringr_1.5.1
## [9] jquerylib_0.1.4 splines_4.4.0 scales_1.3.0 yaml_2.3.8
## [13] fastmap_1.2.0 statmod_1.5.0 plyr_1.8.9 ggplot2_3.5.1
## [17] R6_2.5.1 generics_0.1.3 BBmisc_1.13 fdrtool_1.2.17
## [21] backports_1.5.0 checkmate_2.3.1 tibble_3.2.1 munsell_0.5.1
## [25] bslib_0.7.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4
## [29] stringi_1.8.4 cachem_1.1.0 xfun_0.45 sass_0.4.9
## [33] wrGraph_1.3.7 cli_3.6.3 magrittr_2.0.3 digest_0.6.36
## [37] grid_4.4.0 lifecycle_1.0.4 vctrs_0.6.5 qvalue_2.36.0
## [41] evaluate_0.24.0 glue_1.7.0 data.table_1.15.4 fansi_1.0.6
## [45] colorspace_2.1-0 reshape2_1.4.4 rmarkdown_2.27 tools_4.4.0
## [49] pkgconfig_2.0.3 htmltools_0.5.8.1
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.