recosystem
is an R wrapper of the LIBMF
library developed by Yu-Chin Juan, Yong Zhuang, Wei-Sheng Chin and Chih-Jen Lin (http://www.csie.ntu.edu.tw/~cjlin/libmf/), an open source library for recommender system using marix factorization. (Lin et al. 2014)
The main task of recommender system is to predict unknown entries in the rating matrix based on observed values, as is shown in the table below:
item_1 | item_2 | item_3 | … | item_n | |
---|---|---|---|---|---|
user_1 | 2 | 3 | ?? | … | 5 |
user_2 | ?? | 4 | 3 | … | ?? |
user_3 | 3 | 2 | ?? | … | 3 |
… | … | … | … | … | |
user_m | 1 | ?? | 5 | … | 4 |
Each cell with number in it is the rating given by some user on a specific item, while those marked with question marks are unknown ratings that need to be predicted. In some other literatures, this problem may be given other names, e.g. collaborative filtering, matrix completion, matrix recovery, etc.
A popular technique to solve the recommender system problem is the matrix factorization method. The idea is to approximate the whole rating matrix \(R_{m\times n}\) by the product of two matrices of lower dimensions, \(P_{k\times m}\) and \(Q_{k\times n}\), such that
\[R\approx P'Q\]
Let \(p_u\) be the \(u\)-th column of \(P\), and \(q_v\) be the \(v\)-th column of \(Q\), then the rating given by user \(u\) on item \(v\) would be predicted as \(p'_u q_v\).
A typical solution for \(P\) and \(Q\) is given by the following optimization problem (Chin et al. 2014):
\[\min_{P,Q} \sum_{(u,v)\in R} ((r_{u,v}-p'_u q_v)^2+\lambda_P ||p_u||^2+\lambda_Q ||q_v||^2)\]
where \((u,v)\) are locations of observed entries in \(R\), \(r_{u,v}\) is the observed rating, and \(\lambda_P,\lambda_Q\) are penalty parameters to avoid overfitting.
The LIBMF
library which recosystem
is based on generalizes the formula above a little further, resulting in the following more general but more complicated optimiaztion problem (Chin et al. 2014):
\[\min_{P,Q,a,b} \sum_{(u,v)\in R} ((r_{u,v}-p'_u q_v-a_u-b_v-avg)^2+\lambda_P ||p_u||^2+\lambda_Q ||q_v||^2+\lambda_a||a||^2+\lambda_b||b||^2)\]
The added vectors \(a\) and \(b\) are called user bias vector and item bias vector respectively, with \(\lambda_a\) and \(\lambda_b\) being their corresponding penalty parameters. \(avg\) is the average rating in training data, which has an effect of centering the data first.
LIBMF
itself is a parallelized library, meaning that users can take advantage of multicore CPUs to speed up the computation. It also utilizes some advanced CPU features to further improve the performance. (Lin et al. 2014)
recosystem
is a complete wrapper of LIBMF
, hence the features of LIBMF
are all included in recosystem
. Also, unlike most other R packages for statistical modeling which store the whole dataset into memory, LIBMF
(and hence recosystem
) is much hard-disk-based. The dataset is not loaded into memory at one time, but rather converted into a temporary binary file. Similarly, the constructed model which contains information for prediction is stored in the hard disk. Finally, prediction result is also not in memory but written into a file. That is to say, recosystem
will have a comparatively small memory usage.
The data files, both for training and testing, need to be arranged in sparse matrix triplet form, i.e., each line in the file contains three numbers
user_id item_id rating
Be careful with the convention that user_id
and item_id
start from 0, so the training data file for the example in the beginning will look like
0 0 2
0 1 3
1 1 4
1 2 3
2 0 3
2 1 2
...
And testing data file is
0 2 0
1 0 0
2 2 0
...
Since ratings for testing data are unknown, we put zeros as placeholders. However if their values are really given, the testing data will serve as a validation set on which RMSE of prediction can be calculated.
The usage of recosystem
is quite simple, mainly consisting of four steps:
Reco()
.convert_train()
and convert_test()
to convert data files in text mode into binary form.train()
method. A number of parameters can be set inside the function.predict()
method to compute predictions and write results into hard disk.Below is an example on some simulated data:
library(recosystem)
set.seed(123) # this is a randomized algorithm
trainset = system.file("dat", "smalltrain.txt", package = "recosystem")
testset = system.file("dat", "smalltest.txt", package = "recosystem")
r = Reco()
r$convert_train(trainset)
## Converting...done. 0.00
## binary file generated at /tmp/Rtmps9JWFS/smalltrain.txt.bin
r$convert_test(testset)
## Converting...done. 0.00
## binary file generated at /tmp/Rtmps9JWFS/smalltest.txt.bin
r$train(opts = list(dim = 100, niter = 100,
cost.p = 0.001, cost.q = 0.001))
## Warning: SSE is disabled.
## Reading training data...done. 0.00
## Initializing model...done. 0.00
## iter time
## 1 0.00
## 2 0.00
## 3 0.00
## 4 0.00
## 5 0.00
## 6 0.01
## 7 0.01
## 8 0.01
## 9 0.01
## 10 0.01
## 11 0.01
## 12 0.01
## 13 0.01
## 14 0.01
## 15 0.01
## 16 0.02
## 17 0.02
## 18 0.02
## 19 0.02
## 20 0.02
## 21 0.02
## 22 0.02
## 23 0.02
## 24 0.02
## 25 0.03
## 26 0.03
## 27 0.03
## 28 0.03
## 29 0.03
## 30 0.03
## 31 0.03
## 32 0.03
## 33 0.03
## 34 0.03
## 35 0.04
## 36 0.04
## 37 0.04
## 38 0.04
## 39 0.04
## 40 0.04
## 41 0.04
## 42 0.04
## 43 0.04
## 44 0.04
## 45 0.05
## 46 0.05
## 47 0.05
## 48 0.05
## 49 0.05
## 50 0.05
## 51 0.05
## 52 0.05
## 53 0.05
## 54 0.05
## 55 0.05
## 56 0.06
## 57 0.06
## 58 0.06
## 59 0.06
## 60 0.06
## 61 0.06
## 62 0.06
## 63 0.06
## 64 0.06
## 65 0.06
## 66 0.07
## 67 0.07
## 68 0.07
## 69 0.07
## 70 0.07
## 71 0.07
## 72 0.07
## 73 0.07
## 74 0.07
## 75 0.08
## 76 0.08
## 77 0.08
## 78 0.08
## 79 0.08
## 80 0.08
## 81 0.08
## 82 0.08
## 83 0.08
## 84 0.08
## 85 0.09
## 86 0.09
## 87 0.09
## 88 0.09
## 89 0.09
## 90 0.09
## 91 0.09
## 92 0.09
## 93 0.09
## 94 0.09
## 95 0.09
## 96 0.10
## 97 0.10
## 98 0.10
## 99 0.10
## 100 0.10
## Writing model...done. 0.00
## model file generated at /tmp/Rtmps9JWFS/smalltrain.txt.bin.model
print(r)
## [=== Training set ===]
##
## number of users = 1000
## number of items = 1000
## number of ratings = 10000
## average = 3.007000
##
## [=== Testing set ===]
##
## number of users = 1000
## number of items = 1000
## number of ratings = 10000
## average = 3.005600
##
## [=== Model ===]
##
## number of users = 1000
## number of items = 1000
## dimensions = 100
## lambda p = 0.001000
## lambda q = 0.001000
## lambda ub = -1.000000
## lambda ib = -1.000000
## gamma = 0.001000
## average = 0.000000
outfile = tempfile()
r$predict(outfile)
## Predicting...done. 0.00
## RMSE: 0.992
## output file generated at /tmp/Rtmps9JWFS/file1e67e1013b5
## Compare the first few true values of testing data
## with predicted ones
# True values
print(read.table(testset, header = FALSE, sep = " ", nrows = 10)$V3)
## [1] 3 4 2 3 3 4 3 3 3 3
# Predicted values
print(scan(outfile, n = 10))
## [1] 3.283023 3.005262 3.046829 3.509161 2.003004 3.234289 2.708250
## [8] 2.779694 2.017818 3.440616
Detailed help document for each function is available in topics ?recosystem::Reco
, ?recosystem::convert
, ?recosystem::train
and ?recosystem::predict
.
LIBMF
utilizes some compiler and CPU features that may be unavailable in some systems. To build recosystem
from source, one needs a C++ compiler that supports C++11 standard.
Also, there are some flags in file src/Makevars
that may have influential effect on performance. It is strongly suggested to set proper flags according to your type of CPU before compiling the package, in order to achieve the best performance:
If your CPU doesn’t support SSE3 (typically very old CPUs), set
PKG_CPPFLAGS = -DNOSSE
in the src/Makevars
file.If SSE3 is supported (a list of supported CPUs), set
PKG_CXXFLAGS = -msse3
If not only SSE3 is supported but also AVX (a list of supported CPUs), set
PKG_CXXFLAGS = -mavx
PKG_CPPFLAGS = -DUSEAVX
After editing the Makevars
file, run R CMD INSTALL recosystem
on the package source directory to install recosystem
.
Chin, Wei-Sheng, Yong Zhuang, Yu-Chin Juan, and Chih-Jen Lin. 2014. “A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems.” Technical Report. http://www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf.
Lin, Chih-Jen, Yu-Chin Juan, Yong Zhuang, and Wei-Sheng Chin. 2014. “LIBMF: A Matrix-Factorization Library for Recommender Systems.” http://www.csie.ntu.edu.tw/~cjlin/libmf/.