Basic usage

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Basic usage

Doug Friedman

2022-07-16

Introduction

There are two ways to use the topic model diagnostics included topicdoc. You can calculate all the topic diagnostics at once using topic_diagnostics or use the other functions to calculate the diagnostics individually.

The only prerequisite for using topicdoc is that your topic model is fit using the topicmodels package and that your document-term matrix (DTM) is slam coercible. This includes DTMs created through popular text mining packages like tm and quanteda.

Example

For this example, the Associated Press Dataset from topicmodels is used. It contains a DTM created a series of AP articles from 1988.

library(topicdoc)
library(topicmodels)

data("AssociatedPress")

lda_ap4 <- LDA(AssociatedPress,
               control = list(seed = 33), k = 4)

# See the top 10 terms associated with each of the topics
terms(lda_ap4, 10)
#>       Topic 1  Topic 2   Topic 3     Topic 4     
#>  [1,] "i"      "percent" "bush"      "soviet"    
#>  [2,] "people" "million" "i"         "government"
#>  [3,] "two"    "year"    "president" "united"    
#>  [4,] "police" "billion" "court"     "president" 
#>  [5,] "years"  "new"     "federal"   "people"    
#>  [6,] "new"    "market"  "new"       "police"    
#>  [7,] "city"   "company" "house"     "military"  
#>  [8,] "time"   "prices"  "state"     "states"    
#>  [9,] "three"  "stock"   "dukakis"   "party"     
#> [10,] "like"   "last"    "campaign"  "two"

Here’s how you would run all the diagnostics at once.

topic_diagnostics(lda_ap4, AssociatedPress)
#>   topic_num topic_size mean_token_length dist_from_corpus tf_df_dist
#> 1         1   3476.377               4.1        0.3899012   24.08191
#> 2         2   1910.153               5.6        0.5044673   26.67523
#> 3         3   2504.622               5.4        0.3830014   26.46131
#> 4         4   2581.848               6.5        0.3988826   25.52163
#>   doc_prominence topic_coherence topic_exclusivity
#> 1           1053       -81.83339          7.813034
#> 2            598       -79.50691          9.560433
#> 3            783      -106.40062          9.162590
#> 4            775       -84.46149          9.058854

Here’s how you would run a few of them individually.

topic_size(lda_ap4)
#> [1] 3476.377 1910.153 2504.622 2581.848
mean_token_length(lda_ap4)
#> [1] 4.1 5.6 5.4 6.5

Diagnostics Included

A full list of the diagnostics included are provided below.

Diagnostic/Metric	Function	Description
topic size	`topic_size`	Total (weighted) number of tokens per topic
mean token length	`mean_token_length`	Average number of characters for the top tokens per topic
distance from corpus distribution	`dist_from_corpus`	Distance of a topic’s token distribution from the overall corpus token distribution
distance between token and document frequencies	`tf_df_dist`	Distance between a topic’s token and document distributions
document prominence	`doc_prominence`	Number of unique documents where a topic appears
topic coherence	`topic_coherence`	Measure of how often the top tokens in each topic appear together in the same document
topic exclusivity	`topic_exclusivity`	Measure of how unique the top tokens in each topic are compared to the other topics

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.