The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This R package aims to fit Repeated Linear Regressions in which there are some same terms.
Let’s start with the simplest situation, we want to fit a set of regressions which only differ in one variable. Specifically, denote the response variable as \(y\), and these regressions are as follows.
\[ \begin{array}{ll} y&\sim x_1 + cov_1 + cov_2+\ldots+cov_m\\ y&\sim x_2 + cov_1 +cov_2+\ldots+cov_m\\ \cdot &\sim \cdots\\ y&\sim x_n + cov_1 +cov_2+\ldots+cov_m\\ \end{array} \]
where \(cov_i, i=1,\ldots, m\) are the same variables among these regressions.
Intuitively, we can finish this task by using a simple loop.
However, it is not efficient in that situation. As we all know, in the linear regression, the main goal is to estimate the parameter \(\beta\). And we have
\[ \hat\beta = (X'X)^{-1}X'Y \]
where \(X\) is the design matrix and \(Y\) is the observation of response variable.
It is obvious that there are some same elements in the design matrix, and the larger \(m\) is, the more elements are the same. So I want to reduce the cost of computation by separating the same part in the design matrix.
For the above example, the design matrix can be denoted as \(X=(x, cov)\). If we consider intercept, it also can be seen as the same variable among these regression, so it can be included in \(cov\) naturally. Then we have
\[ (X'X)^{-1}= \left[ \begin{array}{cc} x'x & x'cov \\ cov'x & cov'cov \end{array} \right]= \left[ \begin{array}{ll} a& v'\\ v& B \end{array} \right] \]
Woodbury formula tells us
\[ (A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} \]
Let
\[ A=\left[ \begin{array}{ll} a&O\\ O&B \end{array}\right],\; U=\left[ \begin{array}{ll} 1 & 0\\ O & v \end{array} \right],\; V= \left[ \begin{array}{ll} 0& v'\\ 1& O \end{array} \right] \]
and \(C=I_{2\times 2}\). Then we can apply woodbury formula,
\[ (X'X)^{-1}=(A+UCV)^{-1}=A^{-1}-A^{-1}U(C^{-1}+VA^{-1}U)^{-1}VA^{-1} \]
where
\[ A^{-1}=\left[ \begin{array}{cc} a^{-1}&O\\ O&B^{-1} \end{array} \right] \]
We can do further calculations to simplify and obtain the following result
\[ (X'X)^{-1}=\left[ \begin{array}{cc} 1/a+\frac{a}{a-v'B^{-1}v}v'B^{-1}v & -\frac{v'B^{-1}}{a-v'B^{-1}v}\\ -\frac{B^{-1}v}{a-v'B^{-1}v} & B^{-1}+\frac{-B^{-1}vv'B^{-1}}{a-v'B^{-1}v} \end{array} \right] \]
Notice that matrix \(B\) is the same for all regression, the identical terms for each regression are just \(a\) and \(v\), which are very easy to calculate. So theoretically, we can reduce the cost of computation significantly.
Now test two simulation examples by using the functions in this package.
## use fRLR package
set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
frlr1(X, Y, COV, num_threads = 1)
#> r r.p.value
#> 1 0 0.4380128
#> 2 1 0.7791076
#> 3 2 0.2212869
#> 4 3 0.9495018
#> 5 4 0.6729983
## use simple loop
res = matrix(nrow = 0, ncol = 2)
for (i in 1:ncol(X))
{
mat = cbind(X[,i], COV)
df = as.data.frame(mat)
model = lm(Y~., data = df)
tmp = c(i, summary(model)$coefficients[2, 4])
res = rbind(res, tmp)
}
res
#> [,1] [,2]
#> tmp 1 0.4380128
#> tmp 2 0.7791076
#> tmp 3 0.2212869
#> tmp 4 0.9495018
#> tmp 5 0.6729983
As we can see in the above output, these p-values for the identical variable in each regression are equal between two methods.
Similarly, we can test another example
set.seed(123)
X = matrix(rnorm(50), 10, 5)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
frlr2(X, idx1, idx2, Y, COV, num_threads = 1)
#> r1 r2 r1.p.value r2.p.value
#> 1 1 2 0.53021406 0.895719578
#> 2 2 3 0.01812006 0.009833047
#> 3 3 4 0.29895922 0.963995969
#> 4 4 5 0.91749181 0.712075464
#> 5 1 3 0.33761507 0.210331456
#> 6 1 4 0.51074586 0.966484642
#> 7 1 5 0.12479380 0.152802911
#> 8 2 4 0.79302893 0.902402294
#> 9 2 5 0.73153760 0.663392258
#> 10 3 5 0.32367303 0.877154122
res = matrix(nrow=0, ncol=4)
for (i in 1:length(idx1))
{
mat = cbind(X[, idx1[i]], X[,idx2[i]], COV)
df = as.data.frame(mat)
model = lm(Y~., data = df)
tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
res = rbind(res, tmp)
}
Again, we obtain the same results by different methods.
The main aim of this new method is to reduce the computation cost. Now let’s compare its speed with the simple-loop method.
We can obtain the following time cost for \(99\times 100/2=4950\) linear regressions.
set.seed(123)
n = 100
X = matrix(rnorm(10*n), 10, n)
Y = rnorm(10)
COV = matrix(rnorm(40), 10, 4)
#idx1 = c(1, 2, 3, 4, 1, 1, 1, 2, 2, 3)
#idx2 = c(2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
id = combn(n, 2)
idx1 = id[1, ]
idx2 = id[2, ]
system.time(frlr2(X, idx1, idx2, Y, COV, num_threads = 1))
#> user system elapsed
#> 0.029 0.000 0.029
simpleLoop <- function()
{
res = matrix(nrow=0, ncol=4)
for (i in 1:length(idx1))
{
mat = cbind(X[, idx1[i]], X[,idx2[i]], COV)
df = as.data.frame(mat)
model = lm(Y~., data = df)
tmp = c(idx1[i], idx2[i], summary(model)$coefficients[2,4], summary(model)$coefficients[3,4])
res = rbind(res, tmp)
}
}
system.time(simpleLoop())
#> user system elapsed
#> 10.244 0.007 10.261
We can even speed up by passing num_threads = -1
(use
all possible threads).
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.