Getting Started with SportMiner

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Praveen D Chougale and Usha Ananthakumar

2026-01-12

Introduction

SportMiner is a comprehensive R package for mining, analyzing, and visualizing scientific literature in sport science domains. It provides an end-to-end workflow for:

Retrieving abstracts from the Scopus database
Preprocessing and cleaning text data
Performing advanced topic modeling (LDA, STM, CTM)
Creating publication-ready visualizations
Analyzing keyword co-occurrence networks

This vignette demonstrates the core functionality of SportMiner through a practical example.

Installation

install.packages("SportMiner")

Setting Up Your Scopus API Key

Before using SportMiner, you need a Scopus API key. You can obtain one by registering at Elsevier Developer Portal.

library(SportMiner)

# Option 1: Set directly
sm_set_api_key("your_api_key_here")

# Option 2: Set via environment variable (recommended)
# Add to your .Renviron file:
# SCOPUS_API_KEY=your_api_key_here
# Then restart R and run:
sm_set_api_key()

Step 1: Retrieve Papers from Scopus

Let’s search for papers on talent identification in sport science that use principal component analysis or cluster analysis.

# Define the search query
query <- paste0(
  'TITLE-ABS-KEY(',
  '("talent identification" OR "sport science" OR "athlete") ',
  'AND ',
  '("principal component analysis" OR "PCA" OR "cluster analysis") ',
  ') AND DOCTYPE(ar) AND PUBYEAR > 2010'
)

# Retrieve papers
papers <- sm_search_scopus(
  query = query,
  max_count = 100,
  verbose = TRUE
)

# View the data structure
head(papers[, c("title", "year", "author_keywords")])

Step 2: Preprocess Text Data

Convert the raw abstracts into a clean, stemmed word count format.

# Preprocess abstracts
processed_data <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  min_word_length = 3
)

# View the processed data
head(processed_data)

Step 3: Create Document-Term Matrix

Transform the word counts into a sparse matrix suitable for topic modeling.

# Create DTM
dtm <- sm_create_dtm(
  word_counts = processed_data,
  min_term_freq = 3,
  max_term_freq = 0.5
)

# Check dimensions
print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol))

Step 4: Select Optimal Number of Topics

Use coherence-based selection to find the best number of topics.

# Test different values of k
k_selection <- sm_select_optimal_k(
  dtm = dtm,
  k_range = seq(4, 16, by = 2),
  method = "gibbs",
  plot = TRUE
)

# View results
print(k_selection$results)
print(paste("Optimal k:", k_selection$optimal_k))

Step 5: Train Topic Model

Fit an LDA model using the optimal k.

# Train the model
lda_model <- sm_train_lda(
  dtm = dtm,
  k = k_selection$optimal_k,
  method = "gibbs",
  iter = 500
)

Step 6: Visualize Topics

Topic Frequency Distribution

# Plot document distribution
sm_plot_topic_frequency(
  model = lda_model,
  dtm = dtm
)

Topic Trends Over Time

# Add doc_id to papers for joining
papers$doc_id <- paste0("doc_", seq_len(nrow(papers)))

# Plot trends
sm_plot_topic_trends(
  model = lda_model,
  dtm = dtm,
  metadata = papers,
  doc_id_col = "doc_id"
)

Step 7: Keyword Co-occurrence Network

Visualize how author keywords co-occur across papers.

# Create network
network_plot <- sm_keyword_network(
  data = papers,
  keyword_col = "author_keywords",
  min_cooccurrence = 2,
  top_n = 30
)

print(network_plot)

Advanced: Compare Multiple Models

Compare LDA, STM, and CTM to find the best-performing model.

# Run comparison
comparison <- sm_compare_models(
  dtm = dtm,
  k = 10,
  seed = 1729,
  verbose = TRUE
)

# View metrics
print(comparison$metrics)

# Get recommendation
print(paste("Recommended model:", comparison$recommendation))

# Use the recommended model
best_model <- comparison$models[[tolower(comparison$recommendation)]]

Customizing Visualizations

All plotting functions use the custom theme_sportminer() theme, but you can customize further.

library(ggplot2)

# Create a plot with custom theme settings
p <- sm_plot_topic_frequency(lda_model, dtm)

# Add customizations
p +
  labs(
    title = "Distribution of Research Topics in Sport Science",
    subtitle = "Based on 100 papers from Scopus (2010-2025)"
  ) +
  theme_sportminer(base_size = 14, grid = FALSE)

Best Practices

API Rate Limits: Scopus has rate limits. Use max_count wisely and add delays between large queries.
Reproducibility: Always set seeds when running topic models:
```
sm_train_lda(dtm, k = 10, seed = 1729)
```
Hyperparameter Tuning: Experiment with min_term_freq and max_term_freq in sm_create_dtm() to balance vocabulary size and model performance.
Model Selection: Don’t rely solely on coherence. Inspect the top terms for each topic to ensure interpretability.

Next Steps

Explore the package documentation for detailed function reference
Experiment with different preprocessing and modeling parameters
Contact the maintainer for bug reports and feature requests

Citation

If you use SportMiner in your research, please cite:

citation("SportMiner")

References

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R package for structural topic models. Journal of Statistical Software, 91(2), 1-40.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Getting Started with SportMiner

Praveen D Chougale and Usha Ananthakumar

2026-01-12

Introduction

Installation

Setting Up Your Scopus API Key

Step 1: Retrieve Papers from Scopus

Step 2: Preprocess Text Data

Step 3: Create Document-Term Matrix

Step 4: Select Optimal Number of Topics

Step 5: Train Topic Model

Step 6: Visualize Topics

Top Terms per Topic

Topic Frequency Distribution

Topic Trends Over Time

Step 7: Keyword Co-occurrence Network

Advanced: Compare Multiple Models

Customizing Visualizations

Best Practices

Next Steps

Citation

References