Getting started with wrMisc

Wolfgang Raffelsberger

2020-04-22

Introduction

This package contains a collection of various low-level tools which may be of interest for general re-use. These functions were accumulated over a number of years when treating hight-throughput data from biomedical applications. Besides, these functions are used/integrated in more specialized functions dedicated to specific applications in other packages (eg wrProteo). To get started, we need to load the package “wrMisc” available from CRAN.

library(wrMisc)

Speed optimized functions in the package wrMisc

In hight-throughput experiments in biology (like transcriptomics, proteomics etc…) many different features get measured a number if times (different samples like patients or evolution of a disease). The resulting data typically conatain many (independent) rows (eg >1000 different genes or proteins who’s abundance was measured) and much fewer columns that may get further organized in groups of replicates. As R is a versatile language, multiple options exist for assessing the global characteristics of such data, some are more efficient on a computational point of view. In order to allow fast treatment of very large data-sets some tools have been re-designed for optimal performance.

Assessing basic information about variability (for matrix)

Many mesurement techniques applied in high throughput manner suffer from precision. This means, the same measurements taken twice in a row (ie repeated on the same subject) will very likely not give an identical result. For this reason it is common practice to make replicate measurements to i) estimate mean (ie representative) values and ii) asses the factors contributing to the variablity observed. Briefly, technical replicates represent the case where multiple read-outs of the very same sample are generated and the resulting variability is associated to technical issues during the process of taking measures. Biological replicates represent independant samples and reflect therefore the varibility a given parameter may have in a certain population of individuals. With the tools presented here, both technical and biological replicates can be dealt with. In several cases the interpretation of the resulting numbers should consider the experimental setup, though.

Let’s make a simple matrix as toy data:

grp1 <- rep(LETTERS[1:3],c(3,4,3))
sampNa1 <- paste0(grp1,c(1:3,1:4,1:3))
set.seed(2016); dat1 <- matrix(round(c(runif(50000)+rep(1:1000,50)),3),ncol=10,
  dimnames=list(NULL,sampNa1))
dim(dat1)
## [1] 5000   10
head(dat1)
##         A1    A2    A3    B1    B2    B3    B4    C1    C2    C3
## [1,] 1.180 1.640 1.199 1.118 1.425 1.745 1.253 1.554 1.303 1.856
## [2,] 2.143 2.237 2.730 2.693 2.603 2.293 2.542 2.452 2.148 2.776
## [3,] 3.842 3.155 3.191 3.520 3.686 3.408 3.409 3.871 3.345 3.588
## [4,] 4.134 4.394 4.982 4.320 4.380 4.888 4.965 4.462 4.250 4.647
## [5,] 5.478 5.472 5.488 5.570 5.626 5.765 5.551 5.016 5.659 5.139
## [6,] 6.121 6.294 6.718 6.890 6.999 6.316 6.542 6.119 6.763 6.487

Now lets estimate the standard deviation (sd) for every row:

head(rowSds(dat1))
## [1] 0.2583693 0.2426026 0.2477899 0.3089102 0.2307722 0.3124493
system.time(sd1 <- rowSds(dat1))
##    user  system elapsed 
##       0       0       0
system.time(sd2 <- apply(dat1,1,sd))
##    user  system elapsed 
##    0.06    0.01    0.08

On most systems the equivalent calculation using apply() will run much slower.

Note, there is a minor issue with rounding :

table(round(sd1,13)==round(sd2,13))
## 
## FALSE  TRUE 
##     1  4999

Similarly we can easily calculate the CV (coefficient of variance) for every row :

system.time(cv1 <- rowCVs(dat1))
##    user  system elapsed 
##       0       0       0
system.time(cv2 <- apply(dat1,1,sd)/rowMeans(dat1))
##    user  system elapsed 
##    0.08    0.00    0.08
# typically the calculation using rowCVs is much faster
head(cv1)
## [1] 0.18101959 0.09855083 0.07076678 0.06800894 0.04213940 0.04788568
head(cv2)
## [1] 0.18101959 0.09855083 0.07076678 0.06800894 0.04213940 0.04788568

Note, these calculations will be very efficient as long as the number of rows is >> the number of columns.

Now, let’s assume our data is contains 3 initial samples measured as several replicates (already defined in grp1). Similarly, we can also calculate the sd or CV for each line while splitting into groups of replicates:

# we already defined the grouping :
grp1
##  [1] "A" "A" "A" "B" "B" "B" "B" "C" "C" "C"
system.time(sd1Gr <- rowGrpSds(dat1,grp1))
##    user  system elapsed 
##       0       0       0
head(sd1Gr)
##                A          B         C
## [1,] 0.260269732 0.27074758 0.2768917
## [2,] 0.315291928 0.17144557 0.3140531
## [3,] 0.386666523 0.13115989 0.2632534
## [4,] 0.434443706 0.33521970 0.1986530
## [5,] 0.008082904 0.09672297 0.3413156
## [6,] 0.307168249 0.31475851 0.3230934
# Let's check the results of the first line :
sd1Gr[1,] == c(sd(dat1[1,1:3]),sd(dat1[1,4:7]),sd(dat1[1,8:10]))
##    A    B    C 
## TRUE TRUE TRUE
# The CV :
system.time(cv1Gr <- rowGrpCV(dat1,grp1))
##    user  system elapsed 
##       0       0       0
head(cv1Gr)
##                A          B          C
## [1,] 0.194279471 0.19545033 0.17625186
## [2,] 0.133034569 0.06769147 0.12773308
## [3,] 0.113859400 0.03741279 0.07309886
## [4,] 0.096471585 0.07227288 0.04461104
## [5,] 0.001475162 0.01718603 0.06474939
## [6,] 0.048163108 0.04707197 0.05004286

Fast NA-omit for very large objects

The function na.omit() from the package stats also keeps a trace of all omitted instances. This can be penalizing in terms of memory usage when handeling very large vectors with a high content of NAs. If you don’t need to document precisely which elements got eliminated, this function may offer smoother functioning for very large objects.

aA <- c(11:13,NA,10,NA)
 
str(naOmit(aA))
##  num [1:4] 11 12 13 10
str(na.omit(aA))
##  num [1:4] 11 12 13 10
##  - attr(*, "na.action")= 'omit' int [1:2] 4 6

Minimum distance/difference between values

If you need to find the closest neighbour(s) of a numeric vector, this function will tell you the distance (“dif”,“ppm” or “ratio”) and index (“best”) of the closest neighbour. In case of multiple shortest distaces the index if the first one is reported, and the column “nbest” will display a value of >1.

set.seed(2017); aa <- 10*c(0.1+round(runif(20),2),0.53,0.53)
head(aa)
## [1] 10.2  6.4  5.7  3.9  8.7  8.7
minDiff(aa,ppm=FALSE)
##       index value  dif   rat ncur nbest best
##  [1,]     1  10.2 -0.2 0.981    1     1   19
##  [2,]     2   6.4  0.4 1.070    1     1   15
##  [3,]     3   5.7  0.3 0.950    2     1   15
##  [4,]     4   3.9  0.2 1.050    1     1   10
##  [5,]     5   8.7  0.5 1.060    2     1   18
##  [6,]     6   8.7  0.5 1.060    2     1   18
##  [7,]     7   1.4  0.1 1.080    1     1   13
##  [8,]     8   5.3  0.3 1.060    4     1   17
##  [9,]     9   5.7  0.3 0.950    2     1   15
## [10,]    10   3.7 -0.2 0.949    1     1    4
## [11,]    11   7.7 -0.5 0.939    1     1   18
## [12,]    12   1.0 -0.3 0.769    1     1   13
## [13,]    13   1.3 -0.1 0.929    1     1    7
## [14,]    14   5.3  0.3 1.060    4     1   17
## [15,]    15   6.0  0.3 1.050    1     2    9
## [16,]    16   4.9 -0.1 0.980    1     1   17
## [17,]    17   5.0  0.1 1.020    1     1   16
## [18,]    18   8.2  0.5 1.060    1     1   11
## [19,]    19  10.4  0.2 1.020    1     1    1
## [20,]    20   9.3  0.6 1.070    1     2    6
## [21,]    21   5.3  0.3 1.060    4     1   17
## [22,]    22   5.3  0.3 1.060    4     1   17

Working with lists (and lists of lists)

Partial unlist

If input from different places gets collected and combined into a list, this may give a collection of different types of data. If you would like to preserve multi-column elements as they are (and just bring down one level):

bb <- list(fa=gl(2,2),c=31:33,L2=matrix(21:28,ncol=2),li=list(li1=11:14,li2=data.frame(41:44)))
partUnlist(lapply(bb,.asDF2))
## $fa
##   V1
## 1  1
## 2  1
## 3  2
## 4  2
## 
## $c
##   V1
## 1 31
## 2 32
## 3 33
## 
## $L2
##   V1 V2
## 1 21 25
## 2 22 26
## 3 23 27
## 4 24 28
## 
## $li1
## [1] 11 12 13 14
## 
## $li2
##   X41.44
## 1     41
## 2     42
## 3     43
## 4     44

This won’t be possible using unlist.

head(unlist(bb,recursive=FALSE))
## $fa1
## [1] 1
## 
## $fa2
## [1] 1
## 
## $fa3
## [1] 2
## 
## $fa4
## [1] 2
## 
## $c1
## [1] 31
## 
## $c2
## [1] 32

To uniform such data to obtain a list with one column only for each list-element, this function provides help :

bb <- list(fa=gl(2,2),c=31:33,L2=matrix(21:28,ncol=2),li=list(li1=11:14,li2=data.frame(41:44)))
asSepList(bb)
## $fa
## [1] 1 1 2 2
## 
## $c
## [1] 31 32 33
## 
## $li1
## [1] 11 12 13 14
## 
## $li2
## [1] 41 42 43 44
## 
## $V1
## [1] 21 22 23 24
## 
## $V2
## [1] 25 26 27 28

rbind on lists

When a matrix (or data.frame) gets split into a list, like in the example using by(), such lists can get joined using lrbind in an rbind-like fashion.

dat2 <- matrix(11:34,ncol=3,dimnames=list(letters[1:8],colnames=LETTERS[1:3]))
lst2 <- by(dat2,rep(1:3,c(3,2,3)),as.matrix)
lst2
## INDICES: 1
##    A  B  C
## a 11 19 27
## b 12 20 28
## c 13 21 29
## ------------------------------------------------------------ 
## INDICES: 2
##    A  B  C
## d 14 22 30
## e 15 23 31
## ------------------------------------------------------------ 
## INDICES: 3
##    A  B  C
## f 16 24 32
## g 17 25 33
## h 18 26 34
# join list-elements into single matrix
lrbind(lst2)
##    A  B  C
## a 11 19 27
## b 12 20 28
## c 13 21 29
## d 14 22 30
## e 15 23 31
## f 16 24 32
## g 17 25 33
## h 18 26 34

Fuse content of list-elements with redundant (duplicated) names

When list-elements have the same name, their content (of named numeric or character vectors) may get fused according to the names of the list-elements :

val1 <- 10 +1:26
names(val1) <- letters
(lst1 <- list(c=val1[3:6],a=val1[1:3],b=val1[2:3],a=val1[12],c=val1[13]))
## $c
##  c  d  e  f 
## 13 14 15 16 
## 
## $a
##  a  b  c 
## 11 12 13 
## 
## $b
##  b  c 
## 12 13 
## 
## $a
##  l 
## 22 
## 
## $c
##  m 
## 23
## here the names 'a' and 'c' appear twice :
names(lst1)
## [1] "c" "a" "b" "a" "c"
## now, let's fuse all 'a' and 'c'
fuseCommonListElem(lst1)
## $c
##  c  d  e  f  m 
## 13 14 15 16 23 
## 
## $a
##  a  b  c  l 
## 11 12 13 22 
## 
## $b
##  b  c 
## 12 13

Replacements in list

This function works similar to sub() and allows to search & replace exact matches to a character string along all elements of a list.

(lst1 <- list(aa=1:4,bb=c("abc","efg","abhh","effge"),cc=c("abdc","efg","efgh")))
## $aa
## [1] 1 2 3 4
## 
## $bb
## [1] "abc"   "efg"   "abhh"  "effge"
## 
## $cc
## [1] "abdc" "efg"  "efgh"
listBatchReplace(lst1,search="efg",repl="EFG",sil=FALSE)
## $aa
## [1] "1" "2" "3" "4"
## 
## $bb
## [1] "abc"   "EFG"   "abhh"  "effge"
## 
## $cc
## [1] "abdc" "EFG"  "efgh"

Organize values into list and sort by names

Named numeric or character vectors can be organized into lists, based on their names (only the part before any extensions starting with a point gets considered). Of course, other separators may be defined using the argument sep.

ser1 <- 1:7; names(ser1) <- c("AA","BB","AA.1","CC","AA.b","BB.e","A")

listGroupsByNames(ser1)
## $AA
##   AA AA.1 AA.b 
##    1    3    5 
## 
## $BB
##   BB BB.e 
##    2    6 
## 
## $CC
## CC 
##  4 
## 
## $A
## A 
## 7

If no names are present, the content of the vector itself will be used as name :

listGroupsByNames((1:10)/5)
##  -> listGroupsByNames :  no names found in 'x' !!
## $`0`
##   0   0   0   0 
## 0.2 0.4 0.6 0.8 
## 
## $`1`
##   1   1   1   1   1 
## 1.0 1.2 1.4 1.6 1.8 
## 
## $`2`
## 2 
## 2

Batch-filter list-elements

In the view of object-oriented progamming several methods produce results integrated into lists or S3-objects (eg limma). This functions aims facilitating the filtering of all elements of lists or S3-objects. List-elements with inapropriate number of lines will be ignored.

set.seed(2020); dat1 <- round(runif(80),2)
list1 <- list(m1=matrix(dat1[1:40],ncol=8),m2=matrix(dat1[41:80],ncol=8),other=letters[1:8])
rownames(list1$m1) <- rownames(list1$m2) <- paste0("line",1:5)
# Note: the list-element list1$other has a length different to that of filt. Thus, it won't get filtered.
filterList(list1, list1$m1[,1] >0.4)       # filter according to 1st column of $m1 ...
## $m1
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.65 0.07 0.76 0.54 0.20 0.17 0.96 0.37
## line3 0.62 0.39 0.83 0.65 0.82 0.75 0.96 0.93
## line4 0.48 0.00 0.42 0.55 0.94 0.45 0.95 0.52
## 
## $m2
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.99 0.57 0.58 0.00 0.21 0.61 0.61 0.30
## line3 0.86 0.70 0.90 0.22 0.23 0.58 0.39 0.06
## line4 0.88 0.80 0.52 0.54 0.42 0.65 0.47 0.67
## 
## $other
## [1] "a" "b" "c" "d" "e" "f" "g" "h"
filterList(list1, list1$m1 >0.4) 
## $m1
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.65 0.07 0.76 0.54 0.20 0.17 0.96 0.37
## line3 0.62 0.39 0.83 0.65 0.82 0.75 0.96 0.93
## line4 0.48 0.00 0.42 0.55 0.94 0.45 0.95 0.52
## line5 0.14 0.62 0.41 0.27 0.88 0.56 0.00 0.22
## 
## $m2
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.99 0.57 0.58 0.00 0.21 0.61 0.61 0.30
## line3 0.86 0.70 0.90 0.22 0.23 0.58 0.39 0.06
## line4 0.88 0.80 0.52 0.54 0.42 0.65 0.47 0.67
## line5 0.62 0.17 0.83 0.49 0.86 0.17 0.53 0.72
## 
## $other
## [1] "a" "b" "c" "d" "e" "f" "g" "h"

Transform columns of matrix to list of vectors

(mat1 <- matrix(1:12,ncol=3,dimnames=list(letters[1:4],LETTERS[1:3])))
##   A B  C
## a 1 5  9
## b 2 6 10
## c 3 7 11
## d 4 8 12
str(matr2list(mat1))
## List of 3
##  $ A: Named num [1:4] 1 2 3 4
##   ..- attr(*, "names")= chr [1:4] "A.a" "A.b" "A.c" "A.d"
##  $ B: Named num [1:4] 5 6 7 8
##   ..- attr(*, "names")= chr [1:4] "B.a" "B.b" "B.c" "B.d"
##  $ C: Named num [1:4] 9 10 11 12
##   ..- attr(*, "names")= chr [1:4] "C.a" "C.b" "C.c" "C.d"

Working with arrays

Let’s get stared with a little toy-array:

(arr1 <- array(c(6:4,4:24),dim=c(4,3,2),dimnames=list(c(LETTERS[1:4]),
  paste("col",1:3,sep=""),c("ch1","ch2"))))
## , , ch1
## 
##   col1 col2 col3
## A    6    5    9
## B    5    6   10
## C    4    7   11
## D    4    8   12
## 
## , , ch2
## 
##   col1 col2 col3
## A   13   17   21
## B   14   18   22
## C   15   19   23
## D   16   20   24

CV (coefficient of variance) with arrays

Now we can obtain the CV (coefficient of variance) by splitting along 3rd dimesion (ie this is equivalent to an apply alon the 3rd dimension):

arrayCV(arr1)
##         ch1       ch2
## A 0.3122499 0.2352941
## B 0.3779645 0.2222222
## C 0.4788934 0.2105263
## D 0.5000000 0.2000000
# this is equivalent to
cbind(rowCVs(arr1[,,1]), rowCVs(arr1[,,2]))
##        [,1]      [,2]
## A 0.3122499 0.2352941
## B 0.3779645 0.2222222
## C 0.4788934 0.2105263
## D 0.5000000 0.2000000

Similarly we can split along any other dimension, eg the 2nd dimension :

arrayCV(arr1,byDim=2)
##        col1      col2      col3
## A 0.5210260 0.7713892 0.5656854
## B 0.6698906 0.7071068 0.5303301
## C 0.8187552 0.6527140 0.4991342
## D 0.8485281 0.6060915 0.4714045

Slice 3-dim array in list of matrixes (or arrays)

This procedure is similar to (re-)organizing an intial array into clusters, here we split along a user-defined factor/vector. If a clustering-algorithm produces the cluster assignments, this function can be used to organize the input data accordingly.

cutArrayInCluLike(arr1,cluOrg=c(2,1,2,1))
## $`2`
## , , ch1
## 
##   col1 col2 col3
## A    6    5    9
## C    4    7   11
## 
## , , ch2
## 
##   col1 col2 col3
## A   13   17   21
## C   15   19   23
## 
## 
## $`1`
## , , ch1
## 
##   col1 col2 col3
## B    5    6   10
## D    4    8   12
## 
## , , ch2
## 
##   col1 col2 col3
## B   14   18   22
## D   16   20   24

Let’s cut by filtering along the 3rd dimension for all lines where column ‘col2’ is >7, and then display only the content of columns ‘col1’ and ‘col2’ :

filt3dimArr(arr1,displCrit=c("col1","col2"),filtCrit="col2",filtVal=7,filtTy=">")
## [[1]]
## col1 col2 
##    4    8 
## 
## [[2]]
##   col1 col2
## A   13   17
## B   14   18
## C   15   19
## D   16   20

Working with redundant data

Please note, that there are two ways of interpreting the term ‘unique’ :

In some applications (eg proteomics) initial identifyers may occur multiple times and we frequently need to isolate events/values that occur only once, as the first meaning of ‘unique’. This package provides functions to easily distinguish values occuring just once from those occuring multiple times. Furthermore, there are functions to rename/remove/combine replicated elements.

Identify what is repeated (and where repeated do occur)

aa <- c("li0","n",NA,NA,rep(c("li2","li3"),2),rep("n",4))

Form the package base : The function table() is very useful get some insights when working with smaller objects, but may be slow to handle very large objects. As mentioned, unique() will make everything unqiue, and afterwards you won’t know any more who was unique in the first place ! The function duplicated() (from package base) helps us getting the information who is repeated.

table(aa)
## aa
## li0 li2 li3   n 
##   1   2   2   5
unique(aa) 
## [1] "li0" "n"   NA    "li2" "li3"
duplicated(aa,fromLast=FALSE)
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

findRepeated() (from this package) will return the position/index (and content/value) of repeated elements :

aa <- c(11:16,NA,14:12,NA,14)
names(aa) <- letters[1:length(aa)]
aa
##  a  b  c  d  e  f  g  h  i  j  k  l 
## 11 12 13 14 15 16 NA 14 13 12 NA 14
findRepeated(aa) 
## $`12`
## [1]  2 10
## 
## $`13`
## [1] 3 9
## 
## $`14`
## [1]  4  8 12

firstOfRepeated() tells the index of the first instance of repeated elements, which elements you need to make the vector ‘unique’, and which elements get stripped off when making unique. Please note, that NA (no matter if they occure once or more times) are automatically in the part suggested to be removed.

firstOfRepeated(aa)
## $indRepeated
## 12 13 14 
##  2  3  4 
## 
## $indUniq
## a b c d e f 
## 1 2 3 4 5 6 
## 
## $indRedund
##  g  h  i  j  k  l 
##  7  8  9 10 11 12
aa[firstOfRepeated(aa)$indUniq]          # only unique with names
##  a  b  c  d  e  f 
## 11 12 13 14 15 16
unique(aa)                               # unique() does not return any names !
## [1] 11 12 13 14 15 16 NA

Correct vector to unique (while maintaining the original vector length)

If necessary a counter will be added to non-unique entries, thus no individual values get eliminated and the length and order of the resultant object maintains the same.

This is of importance when assigning rownames to a data.frame : Assingning redundant values/text as rownames of a data.frame will result in an error !

correctToUnique(aa)
##      a      b      c      d      e      f      g      h      i      j      k 
##   "11" "12_1" "13_1" "14_1"   "15"   "16" "NA_1" "14_2" "13_2" "12_2" "NA_2" 
##      l 
## "14_3"
correctToUnique(aa,sep=".",NAenu=FALSE)          # keep NAs (ie without transforming to character)
##      a      b      c      d      e      f      g      h      i      j      k 
##   "11" "12.1" "13.1" "14.1"   "15"   "16"     NA "14.2" "13.2" "12.2"     NA 
##      l 
## "14.3"

Make unique and mark any duplicated (ie ambiguous) elements (by changing their names)

First, the truly unique values are reported and then the first occurance of repeated elements is given, NA instances get ignored.

unique(aa)                                    # names are lost
## [1] 11 12 13 14 15 16 NA
nonAmbiguousNum(aa)
##     a     e     f amb_b amb_c 
##    11    15    16    12    13
nonAmbiguousNum(aa,uniq=FALSE,asLi=TRUE)      # separate in list unique and repeated 
## $unique
##  a  e  f 
## 11 15 16 
## 
## $ambig
## amb_b amb_c 
##    12    13

Combine multiple matrixes where some column-names are the same

Here, it is supposed that you want to join 2 or more matrixes describing different properties of the same collection of individuals (as rows). Common column-names are interpreted that theur respective information should be combined (either as average or as sum).

(ma1 <- matrix(1:6,ncol=3,dimnames=list(1:2,LETTERS[3:1])))
##   C B A
## 1 1 3 5
## 2 2 4 6
(ma2 <- matrix(11:16,ncol=3,dimnames=list(1:2,LETTERS[3:5])))
##    C  D  E
## 1 11 13 15
## 2 12 14 16
cbindNR(ma1,ma2,summarizeAs="mean")       # average of both columns 'C'
##  -> cbindNR :  treating 5 different (types of) columns : C B A D E
##  -> cbindNR :   sorting columns of output
##   A B C  D  E
## 1 5 3 6 13 15
## 2 6 4 7 14 16

Filter matrix to keep only first of repeated lines

This ressembles to the functioning of unique(), but applies to a user-specified column of the matrix.

(mat1 <- matrix(c(1:6,rep(1:3,1:3)),ncol=2,dimnames=list(letters[1:6],LETTERS[1:2])))
##   A B
## a 1 1
## b 2 2
## c 3 2
## d 4 3
## e 5 3
## f 6 3
firstLineOfDat(mat1, refCol=2)
##   A B
## a 1 1
## b 2 2
## d 4 3

This function was rather designed for dealing with character input, it allows concatenating all columns and to remove redundant.

mat2 <- matrix(c("e","n","a","n","z","z","n","z","z","b", 
  "","n","c","n","","","n","","","z"),ncol=2)
firstOfRepLines(mat2,out="conc")
## [1] "e"  "nn" "ac" "z"  "bz"
# or as index :
firstOfRepLines(mat2)
## [1]  1  2  3  5 10

Filter to make unique based on one column and based concatenate to other user-specified column(s)

This function also includes a counter of redundant instances encountered (for 1st column specified)

(df1 <- data.frame(cbind(xA=letters[1:5],xB=c("h","h","f","e","f"),xC=LETTERS[1:5])))
##   xA xB xC
## 1  a  h  A
## 2  b  h  B
## 3  c  f  C
## 4  d  e  D
## 5  e  f  E
nonredDataFrame(df1,useCol=c("xB","xC")) 
##   xA xB xC nSamePep concID
## 1  a  h  A        2   C//E
## 3  c  f  C        2   A//B
## 4  d  e  D        1      D
# without counter or concatenating
df1[which(!duplicated(df1[,2])),]
##   xA xB xC
## 1  a  h  A
## 3  c  f  C
## 4  d  e  D
# or
df1[firstOfRepLines(df1,useCol=2),]
##   xA xB xC
## 1  a  h  A
## 3  c  f  C
## 4  d  e  D

Get first of repeated by column

mat2 <- cbind(no=as.character(1:20),seq=sample(LETTERS[1:15],20,repl=TRUE),
  ty=sample(c("full","Nter","inter"),20,repl=TRUE),ambig=rep(NA,20),seqNa=1:20)
(mat2uniq <- get1stOfRepeatedByCol(mat2,sortBy="seq",sortSupl="ty"))
##       no   seq ty      ambig  seqNa
##  [1,] "6"  "M" "Nter"  NA     "6"  
##  [2,] "11" "C" "inter" NA     "11" 
##  [3,] "12" "N" "Nter"  NA     "12" 
##  [4,] "17" "J" "full"  NA     "17" 
##  [5,] "18" "A" "full"  NA     "18" 
##  [6,] "19" "O" "Nter"  NA     "19" 
##  [7,] "7"  "B" "Nter"  "TRUE" "_7" 
##  [8,] "10" "D" "full"  "TRUE" "_10"
##  [9,] "8"  "E" "full"  "TRUE" "_8" 
## [10,] "9"  "F" "full"  "TRUE" "_9" 
## [11,] "13" "G" "Nter"  "TRUE" "_13"
## [12,] "3"  "H" "Nter"  "TRUE" "_3"
# the values from column 'seq' are indeed unique
table(mat2uniq[,"seq"])
## 
## A B C D E F G H J M N O 
## 1 1 1 1 1 1 1 1 1 1 1 1
# This will return all first repeated (may be >1) but without furter sorting 
#  along column 'ty' neither marking in comumn 'ambig').
mat2[which(duplicated(mat2[,2],fromLast=FALSE)),]
##      no   seq ty     ambig seqNa
## [1,] "5"  "H" "Nter" NA    "5"  
## [2,] "8"  "E" "full" NA    "8"  
## [3,] "9"  "F" "full" NA    "9"  
## [4,] "13" "G" "Nter" NA    "13" 
## [5,] "14" "D" "full" NA    "14" 
## [6,] "15" "D" "full" NA    "15" 
## [7,] "16" "B" "Nter" NA    "16" 
## [8,] "20" "F" "full" NA    "20"

Transform matrix to non-ambiguous matrix (in respect to given column)

nonAmbiguousMat(mat1,by=2)
##       A B
## 1     1 1
## amb_3 3 2
## amb_6 6 3

Here another example, ambiguous will be marked by an ’_’ :

set.seed(2017); mat3 <- matrix(c(1:100,round(rnorm(200),2)),ncol=3,
  dimnames=list(1:100,LETTERS[1:3]));
head(mat3U <- nonAmbiguousMat(mat3,by="B",na="_",uniqO=FALSE),n=15)
##      A     B     C
## 81  81 -2.59 -0.14
## 93  93 -2.02 -0.03
## 7    7 -1.96  0.52
## 4    4 -1.76  0.84
## _74 74 -1.65  0.30
## 55  55 -1.59  1.25
## 52  52 -1.58 -0.24
## 15  15 -1.43 -0.60
## 98  98 -1.34  0.41
## 63  63 -1.33  0.26
## 19  19 -1.13  0.70
## 41  41 -1.06 -0.56
## _56 56 -1.03 -1.07
## 94  94 -0.98 -0.02
## 95  95 -0.97  0.08
head(get1stOfRepeatedByCol(mat3,sortB="B",sortS="B"))
##   A     B     C
## 1 1  1.43  0.02
## 2 2 -0.08  1.38
## 3 3  0.74 -0.07
## 4 4 -1.76  0.84
## 5 5 -0.07 -0.97
## 6 6  0.45 -1.97

Combine replicates from list to matrix

lst2 <- list(aa_1x=matrix(1:12,nrow=4,byrow=TRUE),ab_2x=matrix(24:13,nrow=4,byrow=TRUE))
combineReplFromListToMatr(lst2)
## $a
##        1
##  [1,]  1
##  [2,]  4
##  [3,]  7
##  [4,] 10
##  [5,]  2
##  [6,]  5
##  [7,]  8
##  [8,] 11
##  [9,]  3
## [10,]  6
## [11,]  9
## [12,] 12
## 
## $b
##        2
##  [1,] 24
##  [2,] 21
##  [3,] 18
##  [4,] 15
##  [5,] 23
##  [6,] 20
##  [7,] 17
##  [8,] 14
##  [9,] 22
## [10,] 19
## [11,] 16
## [12,] 13

Non-redundant lines of matrix

mat4 <- matrix(rep(c(1,1:3,3,1),2),ncol=2,dimnames=list(letters[1:6],LETTERS[1:2]))
nonRedundLines(mat4)
##   A B
## a 1 1
## c 2 2
## d 3 3
## f 1 1

Filter for unique elements /2

# input: c and dd are repeated  :
filtSizeUniq(list(A="a",B=c("b","bb","c"),D=c("dd","d","ddd","c")),filtUn=TRUE,minSi=NULL)
##  -> filtSizeUniq : 2 out of 8 peptides redundant
## $A
##   A 
## "a" 
## 
## $B
##  B.1  B.2 
##  "b" "bb" 
## 
## $D
##   D.1   D.2   D.3 
##  "dd"   "d" "ddd"
# here a,b,c and dd are repeated  :
filtSizeUniq(list(A="a",B=c("b","bb","c"),D=c("dd","d","ddd","c")),ref=c(letters[c(1:26,1:3)],
  "dd","dd","bb","ddd"),filtUn=TRUE,minSi=NULL)   
##  -> filtSizeUniq : 8 out of 8 peptides redundant
## $A
## character(0)
## 
## $B
## character(0)
## 
## $D
## character(0)

Make non-redundant matrix

t3 <- data.frame(ref=rep(11:15,3),tx=letters[1:15],
  matrix(round(runif(30,-3,2),1),nc=2),stringsAsFactors=FALSE)
  
# First we split the data.frame in list  
by(t3,t3[,1],function(x) x)
## t3[, 1]: 11
##    ref tx  X1   X2
## 1   11  a 0.4 -1.1
## 6   11  f 0.6  1.0
## 11  11  k 0.1  1.2
## ------------------------------------------------------------ 
## t3[, 1]: 12
##    ref tx   X1   X2
## 2   12  b  2.0 -0.4
## 7   12  g -0.3  1.8
## 12  12  l -1.4  0.3
## ------------------------------------------------------------ 
## t3[, 1]: 13
##    ref tx  X1   X2
## 3   13  c 0.8 -1.6
## 8   13  h 0.8 -2.4
## 13  13  m 0.9  1.8
## ------------------------------------------------------------ 
## t3[, 1]: 14
##    ref tx   X1   X2
## 4   14  d  1.7  0.7
## 9   14  i -1.6 -2.6
## 14  14  n  1.4 -1.1
## ------------------------------------------------------------ 
## t3[, 1]: 15
##    ref tx   X1   X2
## 5   15  e -0.9  0.4
## 10  15  j -1.7  0.0
## 15  15  o -1.2 -1.8
t(sapply(by(t3,t3[,1],function(x) x), summarizeCols, me="maxAbsOfRef"))
##    [,1] [,2] [,3] [,4]
## 11 11   "a"  0.1  1.2 
## 12 12   "b"  -0.3 1.8 
## 13 13   "c"  0.8  -2.4
## 14 14   "d"  -1.6 -2.6
## 15 15   "e"  -1.2 -1.8
(xt3 <- makeNRedMatr(t3, summ="mean", iniID="ref"))
##  -> makeNRedMatr : Common summarization method 'mean', run as batch
##  -> makeNRedMatr : Summarize redundant based on col 'ref'  using method(s) : 'mean', 'mean', 'mean' and 'mean' yielding 4 cols
##    ID ref tx         X1         X2 nRedLi
## 11 11  11  a  0.3666667  0.3666667      3
## 12 12  12  b  0.1000000  0.5666667      3
## 13 13  13  c  0.8333333 -0.7333333      3
## 14 14  14  d  0.5000000 -1.0000000      3
## 15 15  15  e -1.2666667 -0.4666667      3
(xt3 <- makeNRedMatr(t3, summ=unlist(list(X1="maxAbsOfRef")), iniID="ref"))
##  -> makeNRedMatr : Summarize redundant based on col 'ref'  using method(s) : 'maxAbsOfRef' and col 'X1' yielding 4 cols
##    ref tx  X1   X2   nRedLi
## 11 11  "a" 0.6  1    3     
## 12 12  "b" 2    -0.4 3     
## 13 13  "c" 0.9  1.8  3     
## 14 14  "d" 1.7  0.7  3     
## 15 15  "e" -1.7 0    3

Combine/reduce redundant lines based on specified column

matr <- matrix(c(letters[1:6],"h","h","f","e",LETTERS[1:5]),ncol=3,
  dimnames=list(letters[11:15],c("xA","xB","xC")))
combineRedBasedOnCol(matr,colN="xB")
##   xA    xB  xC   
## 2 "a,d" "f" "A,D"
## 3 "b,c" "h" "B,C"
## 1 "e"   "e" "E"
combineRedBasedOnCol(rbind(matr[1,],matr),colN="xB")
##   xA    xB  xC   
## 2 "a,d" "f" "A,D"
## 3 "b,c" "h" "B,C"
## 1 "e"   "e" "E"

Convert matrix (eg with redundant) row-names to data.frame

x <- 1
dat1 <- matrix(1:10,ncol=2)
rownames(dat1) <- letters[c(1:3,2,5)]
## as.data.frame(dat1)  ...  would result in an error
convMatr2df(dat1)
##     ID X1 X2
## a    a  1  6
## b_1  b  2  7
## c    c  3  8
## b_2  b  4  9
## e    e  5 10
convMatr2df(data.frame(a=as.character((1:3)/2),b=LETTERS[1:3],c=1:3))
##   ID   a b c
## 1  1 0.5 A 1
## 2  2 1.0 B 2
## 3  3 1.5 C 3
tmp <- data.frame(a=as.character((1:3)/2),b=LETTERS[1:3],c=1:3,stringsAsFactors=FALSE)
convMatr2df(tmp)
##   ID   a b c
## 1  1 0.5 A 1
## 2  2 1.0 B 2
## 3  3 1.5 C 3
tmp <- data.frame(a=as.character((1:3)/2),b=1:3,stringsAsFactors=FALSE)
convMatr2df(tmp) 
##   ID   a b
## 1  1 0.5 1
## 2  2 1.0 2
## 3  3 1.5 3

Find and combine points located very close in x/y space

set.seed(2013)
datT2 <- matrix(round(rnorm(200)+3,1),ncol=2,dimnames=list(paste("li",1:100,sep=""),
  letters[23:24]))
# (mimick) some short and longer names for each line
inf2 <- cbind(sh=paste(rep(letters[1:4],each=26),rep(letters,4),1:(26*4),sep=""),
 lo=paste(rep(LETTERS[1:4],each=26),rep(LETTERS,4),1:(26*4),",",rep(letters[sample.int(26)],4),
  rep(letters[sample.int(26)],4),sep=""))[1:100,] 
head(datT2,n=10)
##        w   x
## li1  2.9 3.7
## li2  3.8 3.3
## li3  2.3 3.3
## li4  4.4 3.1
## li5  4.5 1.8
## li6  0.4 2.4
## li7  3.7 3.3
## li8  3.3 4.0
## li9  5.0 4.1
## li10 1.6 1.1
head(combineOverlapInfo(datT2,disThr=0.03),n=10)
##        w   x       combInf clu isComb
## li1  2.9 3.7 1+16+22+47+91   1   TRUE
## li2  3.8 3.3     2+7+48+54   2   TRUE
## li3  2.3 3.3       3+66+92   3   TRUE
## li4  4.4 3.1             4  52  FALSE
## li5  4.5 1.8             5  53  FALSE
## li6  0.4 2.4             6  54  FALSE
## li7  3.7 3.3     2+7+48+54   2   TRUE
## li8  3.3 4.0         8+100   4   TRUE
## li9  5.0 4.1             9  55  FALSE
## li10 1.6 1.1            10  56  FALSE
head(combineOverlapInfo(datT2,suplI=inf2[,2],disThr=0.03),n=10)
##        w   x                 combInf clu isComb
## li1  2.9 3.7 AA1+AP16+AV22+BU47+DM91   1   TRUE
## li2  3.8 3.3       AB2+AG7+BV48+CB54   2   TRUE
## li3  2.3 3.3           AC3+CN66+DN92   3   TRUE
## li4  4.4 3.1                  AD4,ww  52  FALSE
## li5  4.5 1.8                  AE5,aj  53  FALSE
## li6  0.4 2.4                  AF6,nl  54  FALSE
## li7  3.7 3.3       AB2+AG7+BV48+CB54   2   TRUE
## li8  3.3 4.0               AH8+DV100   4   TRUE
## li9  5.0 4.1                  AI9,ic  55  FALSE
## li10 1.6 1.1                 AJ10,ee  56  FALSE

Regrouping simultaneaously by two factors

nn <- rep(c("a","e","b","c","d","g","f"),c(3,1,2,2,1,2,1))
qq <- rep(c("m","n","p","o","q"),c(2,1,1,4,4))
nq <- cbind(nn,qq)[c(4,2,9,11,6,10,7,3,5,1,12,8),]
combineByEitherFactor(nq,1,2,nBy=TRUE); combineByEitherFactor(nq,1,2,nBy=FALSE)
##    nn  qq  grp nGrp
## m2 "a" "m" "1" "3" 
## q2 "f" "q" "3" "4" 
## q1 "d" "q" "3" "4" 
## o2 "b" "o" "2" "4" 
## p  "e" "p" "4" "1" 
## o4 "c" "o" "2" "4" 
## q4 "g" "q" "3" "4" 
## n  "a" "n" "1" "3" 
## m1 "a" "m" "1" "3" 
## q3 "g" "q" "3" "4" 
## o1 "b" "o" "2" "4" 
## o3 "c" "o" "2" "4"
##    nn  qq  grp
## m2 "a" "m" "1"
## q2 "f" "q" "3"
## q1 "d" "q" "3"
## o2 "b" "o" "2"
## p  "e" "p" "4"
## o4 "c" "o" "2"
## q4 "g" "q" "3"
## n  "a" "n" "1"
## m1 "a" "m" "1"
## q3 "g" "q" "3"
## o1 "b" "o" "2"
## o3 "c" "o" "2"
combineByEitherFactor(nq,1,2,conv=FALSE); combineByEitherFactor(nq,1,2,conv=TRUE)
##  -> combineByEitherFactor :  not reached convergence at 2nd pass
##    nn  qq  grp
## m2 "a" "m" "1"
## q2 "f" "q" "3"
## q1 "d" "q" "3"
## o2 "b" "o" "2"
## p  "e" "p" "4"
## o4 "c" "o" "2"
## q4 "g" "q" "3"
## n  "a" "n" "1"
## m1 "a" "m" "1"
## q3 "g" "q" "3"
## o1 "b" "o" "2"
## o3 "c" "o" "2"
##    nn  qq  grp
## m2 "a" "m" "1"
## q2 "f" "q" "3"
## q1 "d" "q" "3"
## o2 "b" "o" "2"
## p  "e" "p" "4"
## o4 "c" "o" "2"
## q4 "g" "q" "3"
## n  "a" "n" "1"
## m1 "a" "m" "1"
## q3 "g" "q" "3"
## o1 "b" "o" "2"
## o3 "c" "o" "2"
##  another example
mm <- rep(c("a","b","c","d","e"),c(3,4,2,3,1)); pp <- rep(c("m","n","o","p","q"),c(2,2,2,2,5))
combineByEitherFactor(cbind(mm,pp),1,2,con=FALSE,nBy=TRUE);
##  -> combineByEitherFactor :  not reached convergence at 2nd pass
##    mm  pp  grp nGrp
## m1 "a" "m" "1" "4" 
## m2 "a" "m" "1" "4" 
## n1 "a" "n" "1" "4" 
## n2 "b" "n" "1" "4" 
## o1 "b" "o" "2" "4" 
## o2 "b" "o" "2" "4" 
## p1 "b" "p" "2" "4" 
## p2 "c" "p" "2" "4" 
## q1 "c" "q" "3" "5" 
## q2 "d" "q" "3" "5" 
## q3 "d" "q" "3" "5" 
## q4 "d" "q" "3" "5" 
## q5 "e" "q" "3" "5"

Bin and Summarize values according to their names

dat <- 11:19
names(dat) <- letters[c(6:3,2:4,8,3)]
## Here the names are not unique.
## Thus, the values can be binned by their (non-unique) names and a representative values calculated.

## Let's make a 'datUniq' with the mean of each group of values :
datUniq <- round(tapply(dat,names(dat),mean),1)
## now we propagate the mean values to the full vector 
getValuesByUnique(dat,datUniq)
##    f    e    d    c    b    c    d    h    c 
## 11.0 12.0 15.0 16.3 15.0 16.3 15.0 18.0 16.3
cbind(ini=dat,firstOfRep=getValuesByUnique(dat,datUniq),
  indexUniq=getValuesByUnique(dat,datUniq,asIn=TRUE))
##   ini firstOfRep indexUniq
## f  11       11.0         5
## e  12       12.0         4
## d  13       15.0         3
## c  14       16.3         2
## b  15       15.0         1
## c  16       16.3         2
## d  17       15.0         3
## h  18       18.0         6
## c  19       16.3         2

Import/Export

Batch-reading of csv files

path1 <- system.file("extdata",package="wrMisc")
fiNa <-  c("pl01_1.csv","pl01_2.csv","pl02_1.csv","pl02_2.csv")
datAll <- readCsvBatch(fiNa,path1)
##  -> readCsvBatch :  ready to start reading 4 files;  expecting header : TRUE
##  -> readCsvBatch :  csv-format used 'Europe'  with 96 lines & 3 cols
##  -> readCsvBatch :  ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpuorxkW/Rinst286419945cce/wrMisc/extdata/pl01_1.csv
## -> readCsvBatch -> .removeEmptyCol :  columns no 1, 2 were considered empty and removed
##  -> readCsvBatch :  ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpuorxkW/Rinst286419945cce/wrMisc/extdata/pl01_2.csv
##  -> readCsvBatch :  ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpuorxkW/Rinst286419945cce/wrMisc/extdata/pl02_1.csv
##  -> readCsvBatch :  ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpuorxkW/Rinst286419945cce/wrMisc/extdata/pl02_2.csv
str(datAll)
##  num [1:96, 1:4, 1] 158808 174272 183176 175752 49272 ...
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:96] "A01" "B01" "C01" "D01" ...
##   ..$ : chr [1:4] "1_1.csv" "1_2.csv" "2_1.csv" "2_2.csv"
##   ..$ : chr "StainA"
## batch reading of all csv files in specified path :
datAll2 <- readCsvBatch(fileNames=NULL,path=path1,silent=TRUE)

Normalization

The main reason of normalization is to remove variability to the data which is not directly linked to the (biological) concept of a given experiment. High throuput data from real world measurements may easily contain various deformations due to technical reasons, eg slight temperature variations, electromagnetic interferance etc. The physical loading with equal amounts is ofter also very challenging, smaller variations of this kind are typically adressed by normalization. However, applying aggressive normalization to the data also brings considerable risk of starting to loose some of the effects one intended to study. At some point it may rather be better to eliminate a few samples or branches of an experiment to avoid too invasive intervention. This shows that quality control can be tighly linked to decisions about data-normalization. In conclusion, normalization may be far more challenging than simply running some algorithms.

In general the use has to assume/define some hypothesis to justify intervention. Sometimes specific elements of an experiment are known to be not affected and can therefor be used to normalize the rest. Eg, if you observe growth of trees in a forest, big blocks of rock on the floor are assumed no to change their location. So one could use them as alignment-marks to superpose pictures taken at slightly different positions.

The hypothesis of no global changes is very common : During the course of many biological experiments (eg change of nutrient) one assumes that only a small portion of the elements measured (eg the abundance of all different gene-products) do change, since many processes of a living cell like growth, replication and interacion with neighbour-cells are assumed not to be affected. So, if one assumes that there are no global changes one normalizes the input-data in a way that the average or median across each experiment will give the same value. In analogy, if one takes photographs on a partially cloudy day, most cameras will adjust light settings (sun r clouds) so that global luminosity stays the same. However, if too many of the mesured alements are affected, this normalization approach will lead to (additional) loss of information.

I suggest the user also to include graphical representation(s) of the data helping to visualize variability in various samples along a given experiment and how different normalization procedures affect these outcomes.

if the inout data is a data.frame containing both annotatuin and numeric data, the latter can be extracted using extrNumericFromMatr

In biological high-throughput data columns typically represent different samples, which may be organized as replicates. During high-throughput experiments thousands of (independent) elements are measured (eg abundance of gene-products), they are represented by rows. Note that some experiments may produce a considerable amount of missing data (NAs) which require special attention (dedicated developments exist in other R-packages eg in wrProteo).

My general advice is to first carefully look where such missing data is observed and to pay attention to replicate measurements where a given element once was measured with a real numeric value and once a missing information.

set.seed(2015); rand1 <- round(runif(300)+rnorm(300,0,2),3)
dat1 <- cbind(ser1=round(100:1+rand1[1:100]),ser2=round(1.2*(100:1+rand1[101:200])-2),
  ser3=round((100:1+rand1[201:300])^1.2-3))
dat1 <- cbind(dat1,ser4=round(dat1[,1]^seq(2,5,length.out=100)+rand1[11:110],1))
dat1[dat1 <1] <- NA
  summary(dat1)
##       ser1             ser2             ser3            ser4          
##  Min.   :  2.00   Min.   :  1.00   Min.   :  1.0   Min.   :     37.5  
##  1st Qu.: 26.75   1st Qu.: 28.00   1st Qu.: 50.0   1st Qu.:  67210.0  
##  Median : 49.50   Median : 59.00   Median :109.0   Median : 332524.0  
##  Mean   : 51.14   Mean   : 58.79   Mean   :115.1   Mean   : 542279.4  
##  3rd Qu.: 76.25   3rd Qu.: 89.50   3rd Qu.:173.5   3rd Qu.: 925759.5  
##  Max.   :100.00   Max.   :121.00   Max.   :263.0   Max.   :2123191.7  
##                                    NA's   :1
  head( .normalize(dat1,"mean",list()))
##          ser1     ser2     ser3     ser4
## [1,] 265869.7 279841.0 296551.6 2507.451
## [2,] 265869.7 270590.1 288281.3 2883.445
## [3,] 263211.0 277528.3 310729.4 3246.477
## [4,] 263211.0 277528.3 285918.3 3732.142
## [5,] 257893.6 263651.9 278829.4 4107.535
## [6,] 257893.6 256713.7 287099.8 4718.614
  dat1[c(1:5,50:54,95:100),]
##       ser1 ser2 ser3      ser4
##  [1,]  100  121  251   10000.6
##  [2,]  100  117  244   11500.2
##  [3,]   99  120  263   12948.1
##  [4,]   99  120  242   14885.1
##  [5,]   97  114  236   16382.3
##  [6,]   51   60  111  892534.1
##  [7,]   48   58  109  812490.4
##  [8,]   49   55  108  982907.4
##  [9,]   50   56  107 1188787.7
## [10,]   45   47  102  915343.6
## [11,]    3    6    5     206.4
## [12,]    5    2    7    2570.0
## [13,]    8    1    3   27125.8
## [14,]    6    4    5    6975.9
## [15,]    3    3    1     237.0
## [16,]    2    1   NA      37.5
no1 <- normalizeThis(dat1,refGrp=1:3,meth="mean")
no2 <- normalizeThis(dat1,refGrp=1:3,meth="trimMean",trim=0.4)
no3 <- normalizeThis(dat1,refGrp=1:3,meth="median")
no4 <- normalizeThis(dat1,refGrp=1:3,meth="slope",quantFa=c(0.2,0.8))
dat1[c(1:10,91:100),]
##       ser1 ser2 ser3    ser4
##  [1,]  100  121  251 10000.6
##  [2,]  100  117  244 11500.2
##  [3,]   99  120  263 12948.1
##  [4,]   99  120  242 14885.1
##  [5,]   97  114  236 16382.3
##  [6,]   97  111  243 18819.5
##  [7,]   92  113  232 19259.0
##  [8,]   96  110  219 24268.7
##  [9,]   92  111  223 25330.7
## [10,]   92  114  221 29051.1
## [11,]   11   12   17 83745.1
## [12,]   11    5    9 90054.3
## [13,]   10    8   14 61362.7
## [14,]    7    7    3 11803.2
## [15,]    3    6    5   206.4
## [16,]    5    2    7  2570.0
## [17,]    8    1    3 27125.8
## [18,]    6    4    5  6975.9
## [19,]    3    3    1   237.0
## [20,]    2    1   NA    37.5
cor(dat1[,3],rowMeans(dat1[,1:2],na.rm=TRUE),use="complete.obs")             # high
## [1] 0.9947982
cor(dat1[,4],rowMeans(dat1[,1:2],na.rm=TRUE),use="complete.obs")             # bad
## [1] -0.3473567
cor(dat1[c(1:10,91:100),4],rowMeans(dat1[c(1:10,91:100),1:2],na.rm=TRUE),use="complete.obs")
## [1] -0.1748666
cor(dat1[,3],rowMeans(dat1[,1:2],na.rm=TRUE)^ (1/seq(2,5,length.out=100)),use="complete.obs") 
## [1] 0.9421547

It is suggested to verify normalization results by plots. Note that boxplots may not be appropriate in some cases (eg multimodal distributions).

Filter lines of matrix for max number of NAs

dat1 <- matrix(1:56,ncol=7)
dat1[c(2,3,4,5,6,10,12,18,19,20,22,23,26,27,28,30,31,34,38,39,50,54)] <- NA
dat1; presenceFilt(dat1,gr=gl(3,3)[-(3:4)],maxGr=0)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    1    9   17   25   33   41   49
## [2,]   NA   NA   NA   NA   NA   42   NA
## [3,]   NA   11   NA   NA   35   43   51
## [4,]   NA   NA   NA   NA   36   44   52
## [5,]   NA   13   21   29   37   45   53
## [6,]   NA   14   NA   NA   NA   46   NA
## [7,]    7   15   NA   NA   NA   47   55
## [8,]    8   16   24   32   40   48   56
##        1-2   1-3   2-3
## [1,]  TRUE  TRUE  TRUE
## [2,] FALSE FALSE FALSE
## [3,] FALSE  TRUE  TRUE
## [4,] FALSE  TRUE  TRUE
## [5,]  TRUE  TRUE  TRUE
## [6,] FALSE FALSE FALSE
## [7,]  TRUE  TRUE FALSE
## [8,]  TRUE  TRUE  TRUE
presenceFilt(dat1,gr=gl(2,4)[-1],maxGr=1,ratM=0.1)
##  -> presenceFilt :  correcting 'maxGrpMiss' for group(s) 1 and 2  due to ratMaxNA=0.1
##        1-2
## [1,]  TRUE
## [2,] FALSE
## [3,] FALSE
## [4,] FALSE
## [5,]  TRUE
## [6,] FALSE
## [7,] FALSE
## [8,]  TRUE
presenceFilt(dat1,gr=gl(2,4)[-1],maxGr=2,rat=0.5)
##        1-2
## [1,]  TRUE
## [2,] FALSE
## [3,]  TRUE
## [4,]  TRUE
## [5,]  TRUE
## [6,]  TRUE
## [7,]  TRUE
## [8,]  TRUE
mat3 <- matrix(c(19,20,30, 18,19,28, 16,14,35),ncol=3)
cleanReplicates(mat3,nOutl=1)
##  -> cleanReplicates :  rownames of 'x' either NULL or not unique, replacing by row-numbers
##  -> cleanReplicates : removing 1 entries in lines 2
##   [,1] [,2] [,3]
## 1   19   18   16
## 2   20   19   NA
## 3   30   28   35

Statistical testing

Moderated pair-wise t-test from limma

If you are not familiar with the way data is handled in the Bioconductor package limma and you would like to use some of the tools for running moderated t-tests therein, this will provide easy access :

x <- 1
set.seed(2017); t8 <- matrix(round(rnorm(1600,10,0.4),2),ncol=8,
  dimnames=list(paste("l",1:200),c("AA1","BB1","CC1","DD1","AA2","BB2","CC2","DD2")))
t8[3:6,1:2] <- t8[3:6,1:2]+3     # augment lines 3:6 for AA1&BB1
t8[5:8,5:6] <- t8[5:8,5:6]+3     # augment lines 5:8 for AA2&BB2 (c,d,g,h should be found)
t4 <- log2(t8[,1:4]/t8[,5:8])
fit4 <- moderTest2grp(t4,gl(2,2))
limma::topTable(fit4,coef=1,n=5)                      # effect for 3,4,7,8
##           logFC    AveExpr         t      P.Value    adj.P.Val         B
## l 7  -0.4975806 -0.2436786 -8.712092 7.989390e-17 1.597878e-14 30.668381
## l 4   0.4020373  0.1890232  7.039234 8.460981e-12 8.460981e-10 17.723883
## l 8  -0.3735170 -0.2259811 -6.539873 1.883448e-10 1.255632e-08 14.392733
## l 3   0.3508834  0.1488240  6.143585 1.950186e-09 9.750928e-08 11.923522
## l 27 -0.1348878 -0.1011609 -2.361738 1.866790e-02 6.886965e-01 -3.878176
fit4in <- moderTest2grp(t4,gl(2,2),testO="<")
##  -> moderTest2grp : testing alternative hypothesis: true difference in means is less than 0 (ie focus on 101 results with A less than B)
limma::topTable(fit4in,coef=1,n=5)
##           logFC    AveExpr         t      P.Value    adj.P.Val         B
## l 7  -0.4975806 -0.2436786 -8.712092 3.994695e-17 7.989390e-15 30.668381
## l 4   0.4020373  0.1890232  7.039234 1.000000e+00 1.000000e+00 17.723883
## l 8  -0.3735170 -0.2259811 -6.539873 9.417239e-11 9.417239e-09 14.392733
## l 3   0.3508834  0.1488240  6.143585 1.000000e+00 1.000000e+00 11.923522
## l 27 -0.1348878 -0.1011609 -2.361738 9.333949e-03 6.222633e-01 -3.878176

Multiple moderated pair-wise t-tests from limma

If you want to make multiple pair-wise comparisons :

grp <- factor(rep(LETTERS[c(3,1,4)],c(2,3,3)))
set.seed(2017); t8 <- matrix(round(rnorm(208*8,10,0.4),2),ncol=8,
  dimnames=list(paste(letters[],rep(1:8,each=26),sep=""),paste(grp,c(1:2,1:3,1:3),sep="")))
t8[3:6,1:2] <- t8[3:6,1:2] +3                    # augment lines 3:6 (c-f) 
t8[5:8,c(1:2,6:8)] <- t8[5:8,c(1:2,6:8)] -1.5    # lower lines 
t8[6:7,3:5] <- t8[6:7,3:5] +2.2                  # augment lines 
## expect to find C/A in c,d,g, (h)
## expect to find C/D in c,d,e,f
## expect to find A/D in f,g,(h)  
test8 <- moderTestXgrp(t8,grp) 
head(test8$p.value,n=8)
##             A-C          A-D          C-D
## a1 8.736828e-02 6.776543e-02 9.397304e-01
## b1 4.384118e-01 5.400019e-01 8.205610e-01
## c1 1.094834e-19 6.344497e-01 2.571471e-21
## d1 2.671725e-13 9.915692e-01 2.858699e-13
## e1 1.802454e-03 2.413137e-08 9.735465e-16
## f1 3.188362e-01 2.527208e-32 2.226490e-22
## g1 1.166242e-29 6.410057e-33 5.484445e-01
## h1 1.141181e-05 1.943795e-05 5.674938e-01

Convert p-values to lfdr

Note that this example is too small for estimating really meaningful lfdr values ! For this reason the function fdrtool() from package fdrtool may issue warnings.

set.seed(2017); t8 <- matrix(round(rnorm(160,10,0.4),2),ncol=8,dimnames=list(letters[1:20],
  c("AA1","BB1","CC1","DD1","AA2","BB2","CC2","DD2")))
t8[3:6,1:2] <- t8[3:6,1:2]+3   # augment lines 3:6 (c-f) for AA1&BB1
t8[5:8,5:6] <- t8[5:8,5:6]+3   # augment lines 5:8 (e-h) for AA2&BB2 (c,d,g,h should be found)
head(pVal2lfdr(apply(t8,1,function(x) t.test(x[1:4],x[5:8])$p.value)))
## Warning in fdrtool::fdrtool(z, statistic = "pvalue", plot = FALSE, verbose
## = !silent): There may be too few input test statistics for reliable FDR
## calculations!
##         a         b         c         d         e         f 
## 1.0000000 0.5753562 0.5753562 1.0000000 1.0000000 1.0000000

Search for similar (numeric) values

va1 <- c(4:7,7,7,7,7,8:10)+(1:11)/28600
checkSimValueInSer(va1)
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
cbind(va=va1,simil=checkSimValueInSer(va1))
##              va simil
##  [1,]  4.000035     0
##  [2,]  5.000070     0
##  [3,]  6.000105     0
##  [4,]  7.000140     1
##  [5,]  7.000175     1
##  [6,]  7.000210     1
##  [7,]  7.000245     1
##  [8,]  7.000280     1
##  [9,]  8.000315     0
## [10,]  9.000350     0
## [11,] 10.000385     0
aA <- c(11:17); bB <- c(12.001,13.999); cC <- c(16.2,8,9,12.5,15.9,13.5,15.7,14.1,5)
(cloMa <- findCloseMatch(aA,cC,com="diff",lim=0.5,sor=FALSE))       
## $x2
##  y4 
## 0.5 
## 
## $x3
##   y4   y6 
## -0.5  0.5 
## 
## $x4
##   y6   y8 
## -0.5  0.1 
## 
## $x6
##   y1   y5   y7 
##  0.2 -0.1 -0.3
# all matches (of 2d arg) to/within limit for each of 1st arg ('x'); 'y' ..to 2nd arg = cC
(maAa <- closeMatchMatrix(cloMa,aA,cC,lim=TRUE))  #
##      id.aA aA id.cC   cC disToPred ppmToPred nByGrp isMin nBest
## [1,]     2 12     4 12.5      -0.5  -40000.0      1     1     1
## [2,]     3 13     4 12.5       0.5   40000.0      2     1     2
## [3,]     3 13     6 13.5      -0.5  -37037.0      2     1     2
## [4,]     4 14     8 14.1      -0.1   -7092.2      2     1     1
## [5,]     6 16     5 15.9       0.1    6289.3      3     1     1
(maAa <- closeMatchMatrix(cloMa,aA,cC,lim=FALSE,origN=TRUE))  #
##      id.aA aA id.cC   cC disToPred ppmToPred nByGrp isMin nBest
## [1,]     2 12     4 12.5      -0.5  -40000.0      1     1     1
## [2,]     3 13     4 12.5       0.5   40000.0      2     1     2
## [3,]     3 13     6 13.5      -0.5  -37037.0      2     1     2
## [4,]     4 14     6 13.5       0.5   37037.0      2     0     0
## [5,]     4 14     8 14.1      -0.1   -7092.2      2     1     1
## [6,]     6 16     7 15.7       0.3   19108.0      3     0     0
## [7,]     6 16     5 15.9       0.1    6289.3      3     1     1
## [8,]     6 16     1 16.2      -0.2  -12346.0      3     0     0
(maAa <- closeMatchMatrix(cloMa,cbind(valA=81:87,aA),cbind(valC=91:99,cC),colM=2,
  colP=2,lim=FALSE))
##  -> closeMatchMatrix :  reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
##      id.pred valA aA id.meas valC   cC disToPred ppmToPred nByGrp isMin nBest
## [1,]       2   82 12       4   94 12.5      -0.5  -40000.0      1     1     1
## [2,]       3   83 13       4   94 12.5       0.5   40000.0      2     1     2
## [3,]       3   83 13       6   96 13.5      -0.5  -37037.0      2     1     2
## [4,]       4   84 14       6   96 13.5       0.5   37037.0      2     0     0
## [5,]       4   84 14       8   98 14.1      -0.1   -7092.2      2     1     1
## [6,]       6   86 16       7   97 15.7       0.3   19108.0      3     0     0
## [7,]       6   86 16       5   95 15.9       0.1    6289.3      3     1     1
## [8,]       6   86 16       1   91 16.2      -0.2  -12346.0      3     0     0
(maAa <- closeMatchMatrix(cloMa,cbind(aA,valA=81:87),cC,lim=FALSE,deb=TRUE))  #
## .. xxidentToMatr2a
## .. xxidentToMatr2c
##  -> closeMatchMatrix :  reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
## .. xxidentToMatr2d
## .. xxidentToMatr2e
## .. xxidentToMatr2f
##      id.pred aA valA id.meas measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]       2 12   82       4     12.5      -0.5  -40000.0      1     1     1
## [2,]       3 13   83       4     12.5       0.5   40000.0      2     1     2
## [3,]       3 13   83       6     13.5      -0.5  -37037.0      2     1     2
## [4,]       4 14   84       6     13.5       0.5   37037.0      2     0     0
## [5,]       4 14   84       8     14.1      -0.1   -7092.2      2     1     1
## [6,]       6 16   86       7     15.7       0.3   19108.0      3     0     0
## [7,]       6 16   86       5     15.9       0.1    6289.3      3     1     1
## [8,]       6 16   86       1     16.2      -0.2  -12346.0      3     0     0
a2 <- aA; names(a2) <- letters[1:length(a2)];  c2 <- cC; names(c2) <- letters[10+1:length(c2)]
(cloM2 <- findCloseMatch(x=a2,y=c2,com="diff",lim=0.5,sor=FALSE)) 
## $b
##   n 
## 0.5 
## 
## $c
##    n    p 
## -0.5  0.5 
## 
## $d
##    p    r 
## -0.5  0.1 
## 
## $f
##    k    o    q 
##  0.2 -0.1 -0.3
(maA2 <- closeMatchMatrix(cloM2,predM=cbind(valA=81:87,a2),measM=cbind(valC=91:99,c2),
  colM=2,colP=2,lim=FALSE,asData=TRUE)) 
##  -> closeMatchMatrix :  reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
##     id.pred valA a2 id.meas valC   c2 disToPred ppmToPred nByGrp isMin nBest
## b         b   82 12       n   94 12.5      -0.5  -40000.0      1     1     1
## c_1       c   83 13       n   94 12.5       0.5   40000.0      2     1     2
## c_2       c   83 13       p   96 13.5      -0.5  -37037.0      2     1     2
## d_1       d   84 14       p   96 13.5       0.5   37037.0      2     0     0
## d_2       d   84 14       r   98 14.1      -0.1   -7092.2      2     1     1
## f_1       f   86 16       q   97 15.7       0.3   19108.0      3     0     0
## f_2       f   86 16       o   95 15.9       0.1    6289.3      3     1     1
## f_3       f   86 16       k   91 16.2      -0.2  -12346.0      3     0     0
(maA2 <- closeMatchMatrix(cloM2,cbind(id=names(a2),valA=81:87,a2),cbind(id=names(c2),
  valC=91:99,c2),colM=3,colP=3,lim=FALSE,deb=FALSE)) 
##  -> closeMatchMatrix :  reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long
##   id.pred valA a2   id.meas valC c2     disToPred             ppmToPred nByGrp
## b "b"     "82" "12" "n"     "94" "12.5" "-0.5"                "-40000"  "1"   
## c "c"     "83" "13" "n"     "94" "12.5" "0.5"                 "40000"   "2"   
## c "c"     "83" "13" "p"     "96" "13.5" "-0.5"                "-37037"  "2"   
## d "d"     "84" "14" "p"     "96" "13.5" "0.5"                 "37037"   "2"   
## d "d"     "84" "14" "r"     "98" "14.1" "-0.0999999999999996" "-7092.2" "2"   
## f "f"     "86" "16" "q"     "97" "15.7" "0.300000000000001"   "19108"   "3"   
## f "f"     "86" "16" "o"     "95" "15.9" "0.0999999999999996"  "6289.3"  "3"   
## f "f"     "86" "16" "k"     "91" "16.2" "-0.199999999999999"  "-12346"  "3"   
##   isMin nBest
## b "1"   "1"  
## c "1"   "2"  
## c "1"   "2"  
## d "0"   "0"  
## d "1"   "1"  
## f "0"   "0"  
## f "1"   "1"  
## f "0"   "0"

Find similar numeric values from two vectors/matrixes

aA <- c(11:17); bB <- c(12.001,13.999); cC <- c(16.2,8,9,12.5,12.6,15.9,14.1)
aZ <-  matrix(c(aA,aA+20),ncol=2,dimnames=list(letters[1:length(aA)],c("aaA","aZ")))
cZ <-  matrix(c(cC,cC+20),ncol=2,dimnames=list(letters[1:length(cC)],c("ccC","cZ")))
findCloseMatch(cC,aA,com="diff",lim=0.5,sor=FALSE)
## $x1
##   y6 
## -0.2 
## 
## $x4
##   y2   y3 
## -0.5  0.5 
## 
## $x5
##  y3 
## 0.4 
## 
## $x6
##  y6 
## 0.1 
## 
## $x7
##   y4 
## -0.1
findSimilFrom2sets(aA,cC)
##      aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  2           12  4     12.5      -0.5  -40000.0      1     1     1
## [2,]  3           13  4     12.5       0.5   40000.0      2     0     0
## [3,]  3           13  5     12.6       0.4   31746.0      2     1     1
## [4,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
## [5,]  6           16  6     15.9       0.1    6289.3      2     1     1
## [6,]  6           16  1     16.2      -0.2  -12346.0      2     0     0
findSimilFrom2sets(cC,aA)
##      cC predMatr[, ] aA measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  1         16.2  6       16       0.2   12500.0      1     1     1
## [2,]  4         12.5  2       12       0.5   41667.0      2     1     2
## [3,]  4         12.5  3       13      -0.5  -38462.0      2     1     2
## [4,]  5         12.6  3       13      -0.4  -30769.0      1     1     1
## [5,]  6         15.9  6       16      -0.1   -6250.0      1     1     1
## [6,]  7         14.1  4       14       0.1    7142.9      1     1     1
findSimilFrom2sets(aA,cC,best=FALSE)
##      aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  2           12  4     12.5      -0.5  -40000.0      1     1     1
## [2,]  3           13  4     12.5       0.5   40000.0      2     0     0
## [3,]  3           13  5     12.6       0.4   31746.0      2     1     1
## [4,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
## [5,]  6           16  6     15.9       0.1    6289.3      2     1     1
## [6,]  6           16  1     16.2      -0.2  -12346.0      2     0     0
findSimilFrom2sets(aA,cC,comp="ppm",lim=5e4,deb=TRUE)
## xxfindSimilFrom2sets2
## .. xxidentToMatr2a
## .. xxidentToMatr2c
## .. xxidentToMatr2d
## .. xxidentToMatr2e
## .. xxidentToMatr2f
##      aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  2           12  4     12.5      -0.5  -40000.0      1     1     1
## [2,]  3           13  4     12.5       0.5   40000.0      2     0     0
## [3,]  3           13  5     12.6       0.4   31746.0      2     1     1
## [4,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
## [5,]  6           16  6     15.9       0.1    6289.3      2     1     1
## [6,]  6           16  1     16.2      -0.2  -12346.0      2     0     0
## [7,]  7           17  1     16.2       0.8   49383.0      1     1     1
findSimilFrom2sets(aA,cC,comp="ppm",lim=9e4,bestO=FALSE)
##       aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
##  [1,]  2           12  4     12.5      -0.5  -40000.0      2     1     1
##  [2,]  2           12  5     12.6      -0.6  -47619.0      3     0     0
##  [3,]  3           13  4     12.5       0.5   40000.0      2     0     0
##  [4,]  3           13  5     12.6       0.4   31746.0      3     1     1
##  [5,]  3           13  7     14.1      -1.1  -78014.0      3     0     0
##  [6,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
##  [7,]  5           15  7     14.1       0.9   63830.0      3     1     2
##  [8,]  5           15  6     15.9      -0.9  -56604.0      3     1     2
##  [9,]  5           15  1     16.2      -1.2  -74074.0      2     0     0
## [10,]  6           16  6     15.9       0.1    6289.3      3     0     0
## [11,]  6           16  1     16.2      -0.2  -12346.0      2     0     0
## [12,]  7           17  6     15.9       1.1   69182.0      2     1     0
## [13,]  7           17  1     16.2       0.8   49383.0      2     1     2
# below: find fewer 'best matches' since search window larger (ie more good hits compete !)
findSimilFrom2sets(aA,cC,comp="ppm",lim=9e4,bestO=TRUE)      
##      aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  2           12  4     12.5      -0.5  -40000.0      2     1     1
## [2,]  3           13  5     12.6       0.4   31746.0      3     1     1
## [3,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
## [4,]  5           15  7     14.1       0.9   63830.0      3     1     2
## [5,]  5           15  6     15.9      -0.9  -56604.0      3     1     2
## [6,]  7           17  6     15.9       1.1   69182.0      2     1     0
## [7,]  7           17  1     16.2       0.8   49383.0      2     1     2

Eliminate close (overlapping) points (in x & y space)

da1 <- matrix(c(rep(0:4,5),0.01,1.1,2.04,3.07,4.5),nc=2); da1[,1] <- da1[,1]*99; head(da1)
##      [,1] [,2]
## [1,]    0    0
## [2,]   99    1
## [3,]  198    2
## [4,]  297    3
## [5,]  396    4
## [6,]    0    0
elimCloseCoord(da1)
##  -> elimCloseCoord :  reducing 'x' from 15 to 7 lines
##    [,1] [,2]
## 1     0  0.0
## 2    99  1.0
## 3   198  2.0
## 4   297  3.0
## 5   396  4.0
## 12   99  1.1
## 15  396  4.5

Find close numeric values between two vectors

aa <- 11:14 ; bb <- c(13.1,11.5,14.3,20:21)
findCloseMatch(aa,bb,com="diff",lim=0.6)
## $x1
##  y2 
## 0.5 
## 
## $x2
##   y2 
## -0.5 
## 
## $x3
##  y1 
## 0.1 
## 
## $x4
##  y3 
## 0.3
findCloseMatch(c(a=5,b=11,c=12,d=18),c(G=2,H=11,I=12,J=13)+0.5,comp="diff",lim=2)
## $b
##   H   I 
## 0.5 1.5 
## 
## $c
##    H    I    J 
## -0.5  0.5  1.5
findCloseMatch(c(4,5,11,12,18),c(2,11,12,13,33)+0.5,comp="diff",lim=2)
## $x1
##   y1 
## -1.5 
## 
## $x3
##  y2  y3 
## 0.5 1.5 
## 
## $x4
##   y2   y3   y4 
## -0.5  0.5  1.5
findCloseMatch(c(4,5,11,12,18),c(2,11,12,13,33)+0.5,comp="diff",lim=2,sort=FALSE)
## $x1
##   y1 
## -1.5 
## 
## $x3
##  y2  y3 
## 0.5 1.5 
## 
## $x4
##   y2   y3   y4 
## -0.5  0.5  1.5
.compareByDiff(list(c(a=10,b=11,c=12,d=13),c(H=11,I=12,J=13,K=33)+0.5),limit=1)    # return matrix
##       a     b     c     d
## H FALSE  TRUE  TRUE FALSE
## I FALSE FALSE  TRUE  TRUE
## J FALSE FALSE FALSE  TRUE
## K FALSE FALSE FALSE FALSE
a2 <- c(11:20); names(a2) <- letters[11:20]
b2 <- c(25:5)+c(rep(0,5),(1:10)/50000,rep(0,6)); names(b2) <- LETTERS[25:5]
# In this toy example values with the same label as lower caps or upper caps are approximately the same value :
which(abs(b2-a2[8]) < a2[8]*1e-6*5)                                           # find R=8 
## R 
## 8
# We can also express the (relative) distance as ppm
findCloseMatch(a2,b2,com="ppm",lim=5)                                         # find Q,R,S,T
## $q
##         Q 
## -4.705882 
## 
## $r
##         R 
## -3.333333 
## 
## $s
##         S 
## -2.105263 
## 
## $t
##  T 
## -1
# of course we can define custom limits :
findCloseMatch(a2,b2,com="ratio",lim=1.000005)                                # find Q,R,S,T
## $q
##            Q 
## 6.789137e-06 
## 
## $r
##            R 
## 4.808975e-06 
## 
## $s
##           S 
## 3.03725e-06 
## 
## $t
##            T 
## 1.442694e-06
findCloseMatch(a2,b2,com="diff", lim=0.00005)                                 # find S,T
## $s
##     S 
## 4e-05 
## 
## $t
##     T 
## 2e-05

Tree like structures

Characterize individual contribution of single edges in tree-structures

path1 <- matrix(c(17,19,18,17, 4,4,2,3),ncol=2,
  dimnames=list(c("A/B/C/D","A/B/G/D","A/H","A/H/I"),c("sumLen","n")))
contribToContigPerFrag(path1)
##   sumLe n.frag len.rat
## A    19      4   1.000
## B    19      4   1.000
## C    17      4   0.895
## D    19      4   1.000
## G    19      4   1.000
## H    18      2   0.947
## I    17      3   0.895

Count same start- and end- sites of edges (or fragments)

frag1 <- cbind(beg=c(2,3,7,13,13,15,7,9,7, 3,3,5), end=c(6,12,8,18,20,20,19,12,12, 4,5,7))
rownames(frag1) <- letters[1:nrow(frag1)]
countSameStartEnd(frag1)
##   beg end beg.n beg.rat end.n end.rat
## a   2   6    NA      NA    NA      NA
## b   3  12     3  0.2500     3  0.2500
## c   7   8     3  0.2500    NA      NA
## d  13  18     2  0.1667    NA      NA
## e  13  20     2  0.1667     2  0.1667
## f  15  20    NA      NA     2  0.1667
## g   7  19     3  0.2500    NA      NA
## h   9  12    NA      NA     3  0.2500
## i   7  12     3  0.2500     3  0.2500
## j   3   4     3  0.2500    NA      NA
## k   3   5     3  0.2500    NA      NA
## l   5   7    NA      NA    NA      NA
simpleFragFig(frag1)

Fuse previously identified pairs to ‘clusters’

(daPa <- matrix(c(1:5,8,2:6,9),ncol=2))
##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4
## [4,]    4    5
## [5,]    5    6
## [6,]    8    9
fusePairs(daPa,maxFuse=4)
## 1 2 3 4 4 5 6 8 9 
## 1 1 1 1 2 2 2 3 3

Support for graphical output

Convenient paste-collapse

This function allows adding quotes and separating the last element by specific text (eg ‘and’).

pasteC(1:4)
## [1] "1, 2, 3 and 4"

Transform numeric values to color-gradient

By default most color-gradients end with a color very close to the beginning.

set.seed(2015); dat1 <- round(runif(15),2)
plot(1:15,dat1,pch=16,cex=2,col=colorAccording2(dat1),main="Color gradient according to value in y")

# Here we modify the span of the color gradient
plot(1:15,dat1,pch=16,cex=2,col=colorAccording2(dat1,nStartO=0,nEndO=4,revCol=TRUE),main="blue to red")

# It is also possible to work with scales of transparency
plot(1:9,pch=3)
points(1:9,1:9,col=transpGraySca(st=0,en=0.8,nSt=9,trans=0.3),cex=42,pch=16)

Assign new transparency to given colors

col0 <- c("#998FCC","#5AC3BA","#CBD34E","#FF7D73")
col1 <- convColorToTransp(col0,alph=0.7)
layout(1:2)
pie(rep(1,length(col0)),col=col0)
pie(rep(1,length(col1)),col=col1,main="new transparency")

Session-Info

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                   LC_CTYPE=French_France.1252   
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
## [5] LC_TIME=French_France.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] wrMisc_1.2.0
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.3  magrittr_1.5    limma_3.42.2    htmltools_0.4.0
##  [5] tools_3.6.3     yaml_2.2.1      Rcpp_1.0.4      fdrtool_1.2.15 
##  [9] stringi_1.4.6   rmarkdown_2.1   knitr_1.28      stringr_1.4.0  
## [13] digest_0.6.25   xfun_0.12       rlang_0.4.5     evaluate_0.14