MixtComp can be used via JMixtComp or RMixtComp. JMixtComp requires 3 json files: algo, data and descriptor. RMixtComp requires 3 R objects: algo (a list), data (a list, a data.frame or a matrix) and descriptor (a list).
The data file is a csv file (; separator), each column corresponds to a variable. The first row contains the name of each variable. Others rows correspond to the different samples. See the section associated with each model for more details about data format.
{
"varName1": ["elem11", "elem12", "elem13", "elem14"],
"varName2": ["elem21", "elem22", "elem23", "elem24"],
"varName3": ["elem31", "elem32", "elem33", "elem34"]
}
data <- list(
varName1 = c(elem11, elem12, elem13, elem14),
varName2 = c(elem21, elem22, elem23, elem24),
varName3 = c(elem31, elem32, elem33, elem34),
)
data <- data.frame(
varName1 = c(elem11, elem12, elem13, elem14),
varName2 = c(elem21, elem22, elem23, elem24),
varName3 = c(elem31, elem32, elem33, elem34),
)
data <- matrix(c(elem11, elem12, elem13, elem14,
elem21, elem22, elem23, elem24,
elem31, elem32, elem33, elem34), ncl = 3, dimnames = list(NULL, c("varName1", "varName2", "varName3")))
Descriptor is a list describing the variables used for clustering and the model used. Each element corresponds to a variable and contains two elements: the model used (type
), and the hyperparameters of the model if any (paramStr
). When there is no hyperparameters, user must provide a void string “”. The descriptor object can contain less variables than the data object. Only variables listed in the descriptor object are used for clustering.
{
"varName1": {
"type": "Model1",
"paramStr": "Param1"
},
"varName2": {
"type": "Model2",
"paramStr": "Param2"
},
"varName3": {
"type": "Model3",
"paramStr": ""
}
}
desc <- list(varName1 = list(type = "Model1", paramStr = "param1"),
varName2 = list(type = "Model2", paramStr = "param2"),
varName3 = list(type = "Model3", paramStr = ""))
Algo is a list containing the required parameters of the SEM algorithm.
User can add extra elements, they will be copied in the output object.
{
"nClass": 2,
"nInd": 15,
"nbBurnInIter": 100,
"nbIter": 100,
"nbGibbsBurnInIter": 100,
"nbGibbsIter": 100,
"nInitPerClass": 2,
"nSemTry": 10,
"confidenceLevel": 0.95,
"ratioStableCriterion": 0.9,
"nStableCriterion": 7,
"notes": "You can add any note you wish in non mandatory fields like this one (notes). They will be copied to the output."
}
algo <- list(nClass = 2,
nInd = 15,
nbBurnInIter = 100,
nbIter = 100,
nbGibbsBurnInIter = 100,
nbGibbsIter = 100,
nInitPerClass = 2,
nSemTry = 10,
confidenceLevel = 0.95,
ratioStableCriterion = 0.9,
nStableCriterion = 7,
notes = "You can add any note you wish in non mandatory fields like this one (notes). They will be copied to the output.")
In the RMixtComp, nInd can be omitted and nClass is copied from from mixtCompLearn function’s argument.
Available models | Data type | Restrictions | Hyperparameters |
---|---|---|---|
Gaussian | Real | ||
Weibull | Real | \(`\geq 0`\) | |
Poisson | Integer | \(`\geq 0`\) | |
NegativeBinomial | Integer | \(`\geq 0`\) | |
Multinomial | Categorical | yes (but no need to provide it) | |
Rank_ISR | Rank | yes (but no need to provide it) | |
Func_CS | Functional | yes | |
Func_SharedAlpha_CS | Functional | yes |
Eight models are available in (R)MixtComp
For real data. For a class \(`k`\), parameters are the mean (\(`\mu_k`\)) and the standard deviation (\(`\sigma_k`\)). The distribution function is defined by:
f_k(x) = \frac{1}{\sqrt{2\pi\sigma_k^2}}\exp{\left(-2\frac{(x-\mu_k)^2}{\sigma_k^2}\right)}
For positive real data (usually lifetime). For a class \(`j`\), parameters are the shape (\(`k_j`\)) and the scale (\(`\lambda_j`\)). The distribution function is defined by:
math f_j(x) = \frac{k_j}{\lambda_j} \left(\frac{x}{\lambda_j}\right)^{k_j-1} \exp{\left(-\left(\frac{x}{\lambda_j}\right)^{k_j}\right)}
For positive integer data. For a class \(`k`\), the parameter is the mean and variance (\(`\lambda_k`\)). The density mass function function is defined by:
f_k(x) = \frac{\lambda^k}{k!}\exp{(-\lambda)}
For positive integer data. For a class \(`k`\), parameters are the number of success (\(`n_k`\)) and the probability of success (\(`p_k`\)). The density mass function function is defined by:
f_k(x) = \frac{\Gamma(x+n_k)}{x! \Gamma(n_k)} p_k^{n_k}(1-p_k)^x
For categorical data. For a class \(`k`\), the model has \(`M`\) parameters \(`p_{k,j},\, j=1,...,M`\), where \(`M`\) the number of modalities, corresponding to the probabilities to belong to the modality \(`j`\). \(`p_{k,j},\, j=1,...,M`\) must verify \(`\sum_{j=1}^M p_{k,j} = 1`\).
The density mass function is defined by:
f_k(x = j) = \prod_{j=1}^K p_{k,j}^{a_j} \quad \text{with} \quad a_j = \begin{cases}
1 &\text{if } x = j \\
0 &\text{otherwise}
\end{cases}
The hyperparameter \(`M`\) does not require to be specified, it can be guess from the data. If tou want to specify it, add "nModality: M"
in the appropriate field of the description object.
For ranking data. For a class \(`k`\), the two parameters are the central rank (\(`\mu_k`\)) and the probability of making a wrong comparison (\(`\pi_k`\)). See the article for more details. Ranks have their size \(`M`\) as hyperparameter. But it does not require to be specified, it can be guess from the data. If tou want to specify it, add "nModality: M"
in the appropriate field of the description object.
Real values are saved with the dot as decimal separator. Missing data are indicated by a \(`?`\). Partial data can be provided through intervals denoted by \(`[a:b]`\) where \(`a`\) (resp. \(`b`\)) is a real or \(`-inf`\) (resp. \(`+inf`\)).
{
"varGauss1": ["2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]"]
}
data <- list(
varGauss1 = c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]")
)
data <- data.frame(
varGauss1 = c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]")
)
data <- matrix(c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]"), ncol = 1, dimnames = list(NULL, c("varGauss1")))
Weibull data are real positive values with the dot as decimal separator. Missing data are indicated by a \(`?`\). Partial data can be provided through intervals denoted by \(`[a:b]`\) where \(`a`\) and \(`b`\) are positive reals (\(`b`\) can be +inf).
{
"varWeib1": ["2.1", "0.26", "?", "[0.56:1.28]", "1.21", "[0:5.11]", "[1.65:+inf]"]
}
data <- list(
varWeib1 = c("2.1", "0.26", "?", "[0.56:1.28]", "1.21", "[0:5.11]", "[1.65:+inf]")
)
data <- data.frame(
varWeib1 = c("2.1", "0.26", "?", "[0.56:1.28]", "1.21", "[0:5.11]", "[1.65:+inf]")
)
data <- matrix(c("2.1", "0.26", "?", "[0.56:1.28]", "1.21", "[0:5.11]", "[1.65:+inf]"), ncol = 1, dimnames = list(NULL, c("varWeib1")))
Counting data are positive integer. Missing data are indicated by a \(`?`\). Partial data can be provided through intervals denoted by \(`[a:b]`\) where \(`a`\) and \(`b`\) are positive integers (\(`b`\) can be +inf).
{
"varCount1": ["2", "3", "?", "4", "[2:4]", "[4:+inf]", "1"]
}
data <- list(
varCount1 = c("2", "3", "?", "4", "[2:4]", "[4:+inf]", "1")
)
data <- data.frame(
varCount1 = c("2", "3", "?", "4", "[2:4]", "[4:+inf]", "1")
)
data <- matrix(c("2", "3", "?", "4", "[2:4]", "[4:+inf]", "1"), ncol = 1, dimnames = list(NULL, c("varCount1")))
Modalities must be consecutive integers with 1 as minimal value. Missing data are indicated by a \(`?`\). For partial data, a list of possible values can be provided by \(`\{a_1,...,a_j\}`\), where \(`a_i`\) denotes a modality.
Categorical data before formatting:
varCateg1 | varCateg2 |
---|---|
married | large |
single | small |
status unknown | medium |
divorced | small or medium |
divorced or single | large |
after formatting:
{
"varCat1": ["1", "2", "?", "3, "{2,3}"],
"varCat2": ["3", "1", "2", "{1,2}", "3"]
}
data <- list(
varCat1 = c("1", "2", "?", "3, "{2,3}"),
varCat2 = c("3", "1", "2", "{1,2}", "3")
)
data <- data.frame(
varCat1 = c("1", "2", "?", "3, "{2,3}")),
varCat2 = c("3", "1", "2", "{1,2}", "3")
)
data <- matrix(c("1", "2", "?", "3, "{2,3}",
"3", "1", "2", "{1,2}", "3")), ncol = 2, dimnames = list(NULL, c("varCat1", "varCat2")))
The format of a rank is: \(`o_1,..., o_j`\) where \(`o_1`\) is an integer corresponding to the the number of the object ranked in 1st position. For example: \(`4,2,1,3`\) means that the fourth object is ranked first then the second object is in second position and so on. Missing data can be specified by replacing and object by a \(`?`\) or a list of potential object, for example: \(`4, \{2~3\}, \{2~1\}, ?`\) means that the object ranked in second position is either the object number 2 or the object number 3, then the object ranked in third position is either the object 2 or 1 and the last one can be anything. A totally missing rank is spedified by \(`?,?,...,?`\).
{
"varRank1": ["1,2,3,4", "2,1,3,4", "?,?,?,?", "4,{2,3},{1,3},{1,2}", "2,{1,3},4,{1,3}"]
}
data <- list(
varRank1 = c("1,2,3,4", "2,1,3,4", "?,?,?,?", "4,{2,3},{1,3},{1,2}", "2,{1,3},4,{1,3}")
)
data <- data.frame(
varRank1 = c("1,2,3,4", "2,1,3,4", "?,?,?,?", "4,{2,3},{1,3},{1,2}", "2,{1,3},4,{1,3}"))
)
data <- matrix(c("1,2,3,4", "2,1,3,4", "?,?,?,?", "4,{2,3},{1,3},{1,2}", "2,{1,3},4,{1,3}")), ncol = 1, dimnames = list(NULL, c("varRank1")))
Multinomial | Gaussian | Poisson | NegativeBinomial | Weibull | Rank_ISR | Func_CS | LatentClass | |
---|---|---|---|---|---|---|---|---|
Completely missing | \(`?`\) | \(`?`\) | \(`?`\) | \(`?`\) | \(`?`\) | \(`?,?,?,?`\) | \(`?`\) | |
Finite number of values | \(`\{a,b,c\}`\) | \(`4,\{1~2\},3,\{1~2\}`\) | \(`\{a,b,c\}`\) | |||||
Bounded interval | \(`[a:b]`\) | \(`[a:b]`\) | \(`[a:b]`\) | \(`[a:b]`\) | ||||
Right bounded interval | \(`[-inf:b]`\) | |||||||
Left bounded interval | \(`[a:+inf]`\) | \(`[a:+inf]`\) | \(`[a:+inf]`\) | \(`[a:+inf]`\) |
To perform a (semi-)supervised clustering, user can add a variable named z_class
(with eventually some missing values) with “LatentClass” as model. Missing data are indicated by a \(`?`\). For partial data, a list of possible values can be provided by \(`\{a_1,...,a_j\}`\), where \(`a_i`\) denotes a class number.
{
"varGauss1": ["2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]"],
"z_class" : ["1", "1", "{1,3}", "3", "?", "2", "1"]
}
{
"varGauss1": {
"type": "Gaussian",
"paramStr": ""
},
"z_class": {
"type": "LatentClass",
"paramStr": ""
}
}
data <- list(
varGauss1 = c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]"),
z_class = c("1", "1", "{1,3}", "3", "?", "2", "1")
)
data <- data.frame(
varGauss1 = c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]"),
z_class = c("1", "1", "{1,3}", "3", "?", "2", "1")
)
data <- matrix(c("2.1", "-0.26", "?", "[0.56:1.28]", "1.21", "[-inf:-0.11]", "[-1.65:+inf]",
"1", "1", "{1,3}", "3", "?", "2", "1"), ncol = 2, dimnames = list(NULL, c("varGauss1", "z_class")))
desc <- list(varGauss1 = list(type = "Gaussian", paramStr = ""),
z_class = list(type = "LatentClass", paramStr = ""))