The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
semanticfa performs exploratory factor analysis on language model embeddings of psychological scale items. Given item text, it embeds each item, computes a similarity matrix, and extracts latent factors — entirely from the text, with no human response data required.
The package is designed to feel familiar to psych and EFAtools users.
The package ships with the 50-item IPIP Big Five inventory and precomputed sentence-BERT embeddings, so you can try it with zero setup:
library(semanticfa)
data(big5)
fit <- sfa(
big5$items,
nfactors = 5,
embeddings = big5$embeddings,
scoring = big5$scoring
)
print(fit)
#> Semantic Factor Analysis
#> Encoding: atomic
#> Embedding dim: 384
#> Factors:5 (minres + oblimin)
#>
#> Diagnostics:
#> KMO: 0.866 (meritorious - higher is better)
#> TEFI: -47.0724 (lower is better)
#> RMSR: 0.0556 (acceptable - lower is better)
#> CAF: 0.4880 (marginal - higher is better)
#>
#> Factor loadings:
#>
#> Loadings:
#> MR1 MR4 MR3 MR5 MR2
#> item_39 0.617
#> item_35 0.598
#> item_34 0.561
#> item_05 0.531
#> item_32 0.524
#> item_49 0.509
#> item_04 0.492
#> item_28 0.485
#> item_38 0.444
#> item_12 0.438 0.307
#> item_07 0.425
#> item_31 0.379
#> item_02 0.372 0.338
#> item_01 0.366
#> item_33 0.365
#> item_36 0.343
#> item_13
#> item_37
#> item_16 0.601
#> item_48 0.548 0.307
#> item_24 0.530 0.315
#> item_21 0.493 0.417
#> item_23 0.467
#> item_30 0.464
#> item_47 0.444 0.347
#> item_29 0.413 0.351
#> item_15 0.411
#> item_19 0.383 0.377
#> item_06 0.319
#> item_27 0.888
#> item_25 0.708
#> item_08 0.498
#> item_22 0.480
#> item_09 0.418
#> item_10 0.361 0.400
#> item_03 0.366 0.383
#> item_20 0.786
#> item_14 0.686
#> item_18 0.662
#> item_11 0.498
#> item_17 0.485
#> item_26
#> item_44 0.362 0.686
#> item_42 0.672
#> item_50 0.649
#> item_45 0.603
#> item_43 0.528
#> item_46 0.512
#> item_41 0.327 0.390
#> item_40
#>
#> MR1 MR4 MR3 MR5 MR2
#> SS loadings 4.762 3.180 3.289 2.955 3.149
#> Proportion Var 0.095 0.064 0.066 0.059 0.063
#> Cumulative Var 0.095 0.159 0.225 0.284 0.347
#>
#> Factor correlations (Phi):
#> MR1 MR4 MR3 MR5 MR2
#> MR1 1.000 0.381 0.227 0.337 0.276
#> MR4 0.381 1.000 0.336 0.358 0.129
#> MR3 0.227 0.336 1.000 0.202 0.183
#> MR5 0.337 0.358 0.202 1.000 0.052
#> MR2 0.276 0.129 0.183 0.052 1.000
#>
#> Variance accounted for:
#> MR1 MR4 MR3 MR5 MR2
#> SS loadings 5.968 4.208 3.880 3.572 3.524
#> Proportion Var 0.119 0.084 0.078 0.071 0.070
#> Cumulative Var 0.119 0.204 0.281 0.353 0.423
#> Proportion Explained 0.282 0.199 0.183 0.169 0.167
#> Cumulative Proportion 0.282 0.481 0.665 0.833 1.000When you omit nfactors, sfa() uses embedding-adapted parallel analysis (random unit vectors in the embedding dimension as the null):
fit_auto <- sfa(
big5$items,
embeddings = big5$embeddings,
scoring = big5$scoring
)
cat("Auto-detected factors:", fit_auto$factors, "\n")
#> Auto-detected factors: 8For a multi-method comparison, use sfa_nfactors():
sim <- sfa_similarity(big5$embeddings, encoding = "atomic_reversed",
scoring = big5$scoring)
nf <- sfa_nfactors(sim, big5$embeddings,
methods = c("parallel", "kaiser"),
parallel_iter = 50)
print(nf)
#> Factor retention analysis (embedding-adapted)
#>
#> Method n_factors
#> parallel 8
#> kaiser 13
#> ------------------------
#> Consensus 8
#>
#> Eigenvalues: 13.5 3.5 2.9 2.1 1.8 1.7 1.6 1.5 1.3 1.2 ...The encoding argument controls how embeddings become a similarity matrix:
sim_ar <- sfa_similarity(big5$embeddings, "atomic_reversed", big5$scoring)
sim_sq <- sfa_similarity(big5$embeddings, "squid", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "squid": this method is keying-free by design. Use "atomic_reversed" for keyed
#> sign-flipping.
sim_mcp <- sfa_similarity(big5$embeddings, "mean_centered_pearson", big5$scoring)
#> Warning: 'scoring' has reverse-keyed items but is ignored for encoding =
#> "mean_centered_pearson": this method is keying-free by design. Use
#> "atomic_reversed" for keyed sign-flipping.
cat("atomic_reversed range:", range(sim_ar[lower.tri(sim_ar)]), "\n")
#> atomic_reversed range: -0.8933168 0.7684432
cat("squid range: ", range(sim_sq[lower.tri(sim_sq)]), "\n")
#> squid range: -0.3360389 0.8676635
cat("mean_centered_pearson:", range(sim_mcp[lower.tri(sim_mcp)]), "\n")
#> mean_centered_pearson: -0.02542094 0.8932819SQuID and mean-centered Pearson recover negative correlations between reverse-keyed dimensions — atomic_reversed does not.
plot(fit, type = "scree")Scree plot with parallel analysis threshold
plot(fit, type = "loadings")Factor loading heatmap
The $loadings component works directly with psych functions:
# Run human-data EFA (not run — requires response data)
human_fit <- psych::fa(response_data, nfactors = 5, rotate = "oblimin")
# Compare
psych::factor.congruence(fit$loadings, human_fit$loadings)For NMI, ARI, Frobenius, and disattenuated correlation:
cong <- sfa_congruence(fit, big5$factors, metrics = c("nmi", "ari"))
print(cong)
#> Factor structure congruence
#>
#> NMI: 0.428 (weak - higher is better)
#> ARI: 0.257 (poor - higher is better)Pass any embedding model’s output via embeddings=:
# With sentence-transformers (requires reticulate + Python).
# The default model is "Qwen/Qwen3-Embedding-0.6B"; larger models such as
# "Qwen/Qwen3-Embedding-4B" (8 GB RAM) or "Qwen/Qwen3-Embedding-8B" (16 GB RAM)
# recover factor structure more accurately.
emb <- sfa_embed(my_items, embed = "sbert", model = "Qwen/Qwen3-Embedding-0.6B")
fit <- sfa(my_items, embeddings = emb, scoring = my_scoring)
# Or bring your own function
my_embedder <- function(texts) {
# ... your embedding logic ...
# must return a numeric matrix (n_items x dim)
}
fit <- sfa(my_items, embed = my_embedder, scoring = my_scoring)These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.