Basic use of dfunctions for data-frames in mets
Table of Contents
Simple data manipulation for data-frames
- Renaming variables, Deleting variables
- Looking at the data
- Making new variales for the analysis
- Making factors (groupings)
- Working with factors
- Making a factor from existing numeric variable and vice versa
Here are some key data-manipulation moves on a data-frame which is how we typically organize our data in R. After having read the data into R it will typically be a data-frame, if not we can force it to be a data-frame. The basic idea of the utility functions is to get a simple and easy to type way of making simple data-manipulation on a data-frame much like what is possible in SAS or STATA.
The functions, say, dcut, dfactor and so on are all functions that basically does what the base R cut, factor do, but are easier to use in the context of data-frames and have additional functionality.
library(mets)
data(melanoma)
is.data.frame(melanoma)
melanoma=as.data.frame(melanoma)
Loading required package: timereg Loading required package: survival Loading required package: lava lava version 1.4.7.1 mets version 1.2.1 Attaching package: ‘mets’ The following object is masked _by_ ‘.GlobalEnv’: object.defined [1] TRUE
Here we work on the melanoma data that is already read into R and is a data-frame.
dUtility functions
The structure for all functions is
- dfunction(dataframe,y~x|ifcond,…)
to use the function on y in a dataframe grouped by x if condition ifcond is valid. The basic functions are
…
A generic function daggregate, daggr, can be called with a function as the argument
- daggregate(dataframe,y~x|ifcond,fun=function,…)
Renaming, deleting, keeping, dropping variables
melanoma=drename(melanoma,tykkelse~thick) names(melanoma)
[1] "no" "status" "days" "ulc" "tykkelse" "sex"
Deleting variables
data(melanoma) melanoma=drm(melanoma,~thick+sex) names(melanoma)
[1] "no" "status" "days" "ulc"
or sas style
data(melanoma) melanoma=ddrop(melanoma,~thick+sex) names(melanoma)
[1] "no" "status" "days" "ulc"
alternatively we can also keep certain variables
data(melanoma) melanoma=dkeep(melanoma,~thick+sex+status+days) names(melanoma)
[1] "thick" "sex" "status" "days"
Looking at the data
data(melanoma) dstr(melanoma)
'data.frame': 205 obs. of 6 variables: $ no : int 789 13 97 16 21 469 685 7 932 944 ... $ status: int 3 3 2 3 1 1 1 1 3 1 ... $ days : int 10 30 35 99 185 204 210 232 232 279 ... $ ulc : int 1 0 0 0 1 1 1 1 1 1 ... $ thick : int 676 65 134 290 1208 484 516 1288 322 741 ... $ sex : int 1 1 1 0 1 1 1 1 0 0 ...
The data can in Rstudio be seen as a data-table but to list certain parts of the data in output window
dlist(melanoma)
no status days ulc thick sex 1 789 3 10 1 676 1 2 13 3 30 0 65 1 3 97 2 35 0 134 1 4 16 3 99 0 290 0 5 21 1 185 1 1208 1 --- 201 317 2 4492 1 706 1 202 798 2 4668 0 612 0 203 806 2 4688 0 48 0 204 606 2 4926 0 226 0 205 328 2 5565 0 290 0
dlist(melanoma, ~.|sex==1)
no status days ulc thick 1 789 3 10 1 676 2 13 3 30 0 65 3 97 2 35 0 134 5 21 1 185 1 1208 6 469 1 204 1 484 --- 191 445 2 3909 1 806 195 415 2 4119 0 65 197 175 2 4207 0 65 198 493 2 4310 0 210 201 317 2 4492 1 706
dlist(melanoma, ~ulc+days+thick+sex|sex==1)
ulc days thick sex 1 1 10 676 1 2 0 30 65 1 3 0 35 134 1 5 1 185 1208 1 6 1 204 484 1 --- 191 1 3909 806 1 195 0 4119 65 1 197 0 4207 65 1 198 0 4310 210 1 201 1 4492 706 1
Getting summaries
dsummary(melanoma)
no status days ulc thick Min. : 2.0 Min. :1.00 Min. : 10 Min. :0.000 Min. : 10 1st Qu.:222.0 1st Qu.:1.00 1st Qu.:1525 1st Qu.:0.000 1st Qu.: 97 Median :469.0 Median :2.00 Median :2005 Median :0.000 Median : 194 Mean :463.9 Mean :1.79 Mean :2153 Mean :0.439 Mean : 292 3rd Qu.:731.0 3rd Qu.:2.00 3rd Qu.:3042 3rd Qu.:1.000 3rd Qu.: 356 Max. :992.0 Max. :3.00 Max. :5565 Max. :1.000 Max. :1742 sex Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.3854 3rd Qu.:1.0000 Max. :1.0000
or for specfic variables
dsummary(melanoma,~thick+status+sex)
thick status sex Min. : 10 Min. :1.00 Min. :0.0000 1st Qu.: 97 1st Qu.:1.00 1st Qu.:0.0000 Median : 194 Median :2.00 Median :0.0000 Mean : 292 Mean :1.79 Mean :0.3854 3rd Qu.: 356 3rd Qu.:2.00 3rd Qu.:1.0000 Max. :1742 Max. :3.00 Max. :1.0000
Summaries in different groups (sex)
dsummary(melanoma,thick+days+status~sex)
sex: 0 thick days status Min. : 10.0 Min. : 99 Min. :1.000 1st Qu.: 97.0 1st Qu.:1636 1st Qu.:2.000 Median : 162.0 Median :2059 Median :2.000 Mean : 248.6 Mean :2283 Mean :1.833 3rd Qu.: 306.0 3rd Qu.:3131 3rd Qu.:2.000 Max. :1742.0 Max. :5565 Max. :3.000 ------------------------------------------------------------ sex: 1 thick days status Min. : 16.0 Min. : 10 Min. :1.000 1st Qu.: 105.0 1st Qu.:1052 1st Qu.:1.000 Median : 258.0 Median :1860 Median :2.000 Mean : 361.1 Mean :1946 Mean :1.722 3rd Qu.: 484.0 3rd Qu.:2784 3rd Qu.:2.000 Max. :1466.0 Max. :4492 Max. :3.000
and only among those with thin-tumours or only females (sex==1)
dsummary(melanoma,thick+days+status~sex|thick<97)
sex: 0 thick days status Min. :10.00 Min. : 355 Min. :1.000 1st Qu.:32.00 1st Qu.:1762 1st Qu.:2.000 Median :64.00 Median :2227 Median :2.000 Mean :51.48 Mean :2425 Mean :2.034 3rd Qu.:65.00 3rd Qu.:3185 3rd Qu.:2.000 Max. :81.00 Max. :4688 Max. :3.000 ------------------------------------------------------------ sex: 1 thick days status Min. :16.00 Min. : 30 Min. :1.000 1st Qu.:30.00 1st Qu.:1820 1st Qu.:2.000 Median :65.00 Median :2886 Median :2.000 Mean :55.75 Mean :2632 Mean :1.875 3rd Qu.:81.00 3rd Qu.:3328 3rd Qu.:2.000 Max. :81.00 Max. :4207 Max. :3.000
dsummary(melanoma,thick+status~+1|sex==1)
thick status Min. : 16.0 Min. :1.000 1st Qu.: 105.0 1st Qu.:1.000 Median : 258.0 Median :2.000 Mean : 361.1 Mean :1.722 3rd Qu.: 484.0 3rd Qu.:2.000 Max. :1466.0 Max. :3.000
Tables between variables
dtable(melanoma,~status+sex)
sex 0 1 status 1 28 29 2 91 43 3 7 7
All bivariate tables
dtable(melanoma,~status+sex+ulc,level=2)
status sex 1 2 3 0 28 91 7 1 29 43 7 status ulc 1 2 3 0 16 92 7 1 41 42 7 sex ulc 0 1 0 79 36 1 47 43
All univariate tables
dtable(melanoma,~status+sex+ulc,level=1)
status 1 2 3 57 134 14 sex 0 1 126 79 ulc 0 1 115 90
Making new variales for the analysis
To define a bunch of new covariates within a data-frame
melanoma= transform(melanoma, thick2=thick^2, lthick=log(thick) ) dhead(melanoma)
no status days ulc thick sex thick2 lthick 1 789 3 10 1 676 1 456976 6.516193 2 13 3 30 0 65 1 4225 4.174387 3 97 2 35 0 134 1 17956 4.897840 4 16 3 99 0 290 0 84100 5.669881 5 21 1 185 1 1208 1 1459264 7.096721 6 469 1 204 1 484 1 234256 6.182085
When the above definitions are done using a condition this can be achieved using the dtransform function that extends transform with a possible condition
melanoma=dtransform(melanoma,ll=thick*1.05^ulc,sex==1) melanoma=dtransform(melanoma,ll=thick,sex!=1) dsummary(melanoma,ll~sex+ulc)
sex: 0 ulc: 0 ll Min. : 10.0 1st Qu.: 65.0 Median : 129.0 Mean : 173.7 3rd Qu.: 194.0 Max. :1288.0 ------------------------------------------------------------ sex: 1 ulc: 0 ll Min. : 16.0 1st Qu.: 65.0 Median : 97.0 Mean : 197.4 3rd Qu.: 198.0 Max. :1466.0 ------------------------------------------------------------ sex: 0 ulc: 1 ll Min. : 16.0 1st Qu.: 177.0 Median : 258.0 Mean : 374.6 3rd Qu.: 403.0 Max. :1742.0 ------------------------------------------------------------ sex: 1 ulc: 1 ll Min. : 85.05 1st Qu.: 338.10 Median : 506.10 Mean : 523.12 3rd Qu.: 659.40 Max. :1352.40
Making factors (groupings)
On the melanoma data the variable thick gives the thickness of the melanom tumour. For some analyses we would like to make a factor depending on the thickness. This can be done in several different ways
melanoma=dcut(melanoma,~thick,breaks=c(0,200,500,800,2000))
New variable is named thickcat.0 by default.
To see levels of factors in data-frame
dlevels(melanoma)
thickcat.0 #levels=:4 [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]" -----------------------------------------
Checking group sizes
dtable(melanoma,~thickcat.0)
thickcat.0 [0,200] (200,500] (500,800] (800,2e+03] 109 64 20 12
With adding to the data-frame directly
dcut(melanoma,breaks=c(0,200,500,800,2000)) <- gr.thick1~thick
dlevels(melanoma)
thickcat.0 #levels=:4 [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]" ----------------------------------------- gr.thick1 #levels=:4 [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]" -----------------------------------------
new variable is named thickcat.0 (after first cut-point), or to get quartiles with default names thick.cat.4
dcut(melanoma) <- ~ thick ### new variable is thickcat.4 dlevels(melanoma)
thickcat.0 #levels=:4 [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]" ----------------------------------------- gr.thick1 #levels=:4 [1] "[0,200]" "(200,500]" "(500,800]" "(800,2e+03]" ----------------------------------------- thickcat.4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" -----------------------------------------
or median groups, here starting again with the original data,
data(melanoma) dcut(melanoma,breaks=2) <- ~ thick ### new variable is thick.2 dlevels(melanoma)
thickcat.2 #levels=:2 [1] "[10,194]" "(194,1.74e+03]" -----------------------------------------
to control new names
data(melanoma) mela= dcut(melanoma,thickcat4+dayscat4~thick+days,breaks=4) dlevels(mela)
thickcat4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" ----------------------------------------- dayscat4 #levels=:4 [1] "[10,1.52e+03]" "(1.52e+03,2e+03]" "(2e+03,3.04e+03]" [4] "(3.04e+03,5.56e+03]" -----------------------------------------
or
data(melanoma)
dcut(melanoma,breaks=4) <- thickcat4+dayscat4~thick+days
dlevels(melanoma)
thickcat4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" ----------------------------------------- dayscat4 #levels=:4 [1] "[10,1.52e+03]" "(1.52e+03,2e+03]" "(2e+03,3.04e+03]" [4] "(3.04e+03,5.56e+03]" -----------------------------------------
This can also be typed out more specifically
melanoma$gthick = cut(melanoma$thick,breaks=c(0,200,500,800,2000))
melanoma$gthick = cut(melanoma$thick,breaks=quantile(melanoma$thick),include.lowest=TRUE)
Working with factors
To see levels of covariates in data-frame
data(melanoma)
dcut(melanoma,breaks=4) <- thickcat4~thick
dlevels(melanoma)
thickcat4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" -----------------------------------------
To relevel the factor
dtable(melanoma,~thickcat4)
melanoma = drelevel(melanoma,~thickcat4,ref="(194,356]")
dlevels(melanoma)
thickcat4 [10,97] (97,194] (194,356] (356,1.74e+03] 56 53 45 51 thickcat4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" ----------------------------------------- thickcat4.(194,356] #levels=:4 [1] "(194,356]" "[10,97]" "(97,194]" "(356,1.74e+03]" -----------------------------------------
or to take the third level in the list of levels, same as above,
melanoma = drelevel(melanoma,~thickcat4,ref=2) dlevels(melanoma)
thickcat4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" ----------------------------------------- thickcat4.(194,356] #levels=:4 [1] "(194,356]" "[10,97]" "(97,194]" "(356,1.74e+03]" ----------------------------------------- thickcat4.2 #levels=:4 [1] "(97,194]" "[10,97]" "(194,356]" "(356,1.74e+03]" -----------------------------------------
To combine levels of a factor (first combinining first 3 groups into one)
melanoma = drelevel(melanoma,~thickcat4,newlevels=1:3) dlevels(melanoma)
thickcat4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" ----------------------------------------- thickcat4.(194,356] #levels=:4 [1] "(194,356]" "[10,97]" "(97,194]" "(356,1.74e+03]" ----------------------------------------- thickcat4.2 #levels=:4 [1] "(97,194]" "[10,97]" "(194,356]" "(356,1.74e+03]" ----------------------------------------- thickcat4.1:3 #levels=:2 [1] "[10,97]-(194,356]" "(356,1.74e+03]" -----------------------------------------
or to combine groups 1 and 2 into one group and 3 and 4 into another
dkeep(melanoma) <- ~thick+thickcat4
melanoma = drelevel(melanoma,gthick2~thickcat4,newlevels=list(1:2,3:4))
dlevels(melanoma)
thickcat4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" ----------------------------------------- gthick2 #levels=:2 [1] "[10,97]-(97,194]" "(194,356]-(356,1.74e+03]" -----------------------------------------
Do the same but control name of new groups
melanoma=drelevel(melanoma,gthick3~thickcat4,newlevels=list(group1.2=1:2,group3.4=3:4)) dlevels(melanoma)
thickcat4 #levels=:4 [1] "[10,97]" "(97,194]" "(194,356]" "(356,1.74e+03]" ----------------------------------------- gthick2 #levels=:2 [1] "[10,97]-(97,194]" "(194,356]-(356,1.74e+03]" ----------------------------------------- gthick3 #levels=:2 [1] "group1.2" "group3.4" -----------------------------------------
Making a factor from existing numeric variable and vice versa
A numeric variable "status" with values 1,2,3 into a factor by
data(melanoma) melanoma = dfactor(melanoma,~status, labels=c("malignant-melanoma","censoring","dead-other")) melanoma = dfactor(melanoma,sexl~sex,labels=c("females","males")) dtable(melanoma,~sexl+status.f)
status.f malignant-melanoma censoring dead-other sexl females 28 91 7 males 29 43 7
A gender factor with values "M", "F" can be converted into numerics by
melanoma = dnumeric(melanoma,~sexl) dstr(melanoma,"sex*") dtable(melanoma,~'sex*',level=2)
'data.frame': 205 obs. of 3 variables: $ sex : int 1 1 1 0 1 1 1 1 0 0 ... $ sexl : Factor w/ 2 levels "females","males": 2 2 2 1 2 2 2 2 1 1 ... $ sexl.n: num 2 2 2 1 2 2 2 2 1 1 ... sex sexl 0 1 females 126 0 males 0 79 sex sexl.n 0 1 1 126 0 2 0 79 sexl sexl.n females males 1 126 0 2 0 79