The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Getting Started with SportMiner

Praveen D Chougale and Usha Ananthakumar

2026-01-12

Introduction

SportMiner is a comprehensive R package for mining, analyzing, and visualizing scientific literature in sport science domains. It provides an end-to-end workflow for:

This vignette demonstrates the core functionality of SportMiner through a practical example.

Installation

install.packages("SportMiner")

Setting Up Your Scopus API Key

Before using SportMiner, you need a Scopus API key. You can obtain one by registering at Elsevier Developer Portal.

library(SportMiner)

# Option 1: Set directly
sm_set_api_key("your_api_key_here")

# Option 2: Set via environment variable (recommended)
# Add to your .Renviron file:
# SCOPUS_API_KEY=your_api_key_here
# Then restart R and run:
sm_set_api_key()

Step 1: Retrieve Papers from Scopus

Let’s search for papers on talent identification in sport science that use principal component analysis or cluster analysis.

# Define the search query
query <- paste0(
  'TITLE-ABS-KEY(',
  '("talent identification" OR "sport science" OR "athlete") ',
  'AND ',
  '("principal component analysis" OR "PCA" OR "cluster analysis") ',
  ') AND DOCTYPE(ar) AND PUBYEAR > 2010'
)

# Retrieve papers
papers <- sm_search_scopus(
  query = query,
  max_count = 100,
  verbose = TRUE
)

# View the data structure
head(papers[, c("title", "year", "author_keywords")])

Step 2: Preprocess Text Data

Convert the raw abstracts into a clean, stemmed word count format.

# Preprocess abstracts
processed_data <- sm_preprocess_text(
  data = papers,
  text_col = "abstract",
  min_word_length = 3
)

# View the processed data
head(processed_data)

Step 3: Create Document-Term Matrix

Transform the word counts into a sparse matrix suitable for topic modeling.

# Create DTM
dtm <- sm_create_dtm(
  word_counts = processed_data,
  min_term_freq = 3,
  max_term_freq = 0.5
)

# Check dimensions
print(paste("Documents:", dtm$nrow, "| Terms:", dtm$ncol))

Step 4: Select Optimal Number of Topics

Use coherence-based selection to find the best number of topics.

# Test different values of k
k_selection <- sm_select_optimal_k(
  dtm = dtm,
  k_range = seq(4, 16, by = 2),
  method = "gibbs",
  plot = TRUE
)

# View results
print(k_selection$results)
print(paste("Optimal k:", k_selection$optimal_k))

Step 5: Train Topic Model

Fit an LDA model using the optimal k.

# Train the model
lda_model <- sm_train_lda(
  dtm = dtm,
  k = k_selection$optimal_k,
  method = "gibbs",
  iter = 500
)

Step 6: Visualize Topics

Top Terms per Topic

# Plot top terms
sm_plot_topic_terms(
  model = lda_model,
  n_terms = 10
)

Topic Frequency Distribution

# Plot document distribution
sm_plot_topic_frequency(
  model = lda_model,
  dtm = dtm
)

Step 7: Keyword Co-occurrence Network

Visualize how author keywords co-occur across papers.

# Create network
network_plot <- sm_keyword_network(
  data = papers,
  keyword_col = "author_keywords",
  min_cooccurrence = 2,
  top_n = 30
)

print(network_plot)

Advanced: Compare Multiple Models

Compare LDA, STM, and CTM to find the best-performing model.

# Run comparison
comparison <- sm_compare_models(
  dtm = dtm,
  k = 10,
  seed = 1729,
  verbose = TRUE
)

# View metrics
print(comparison$metrics)

# Get recommendation
print(paste("Recommended model:", comparison$recommendation))

# Use the recommended model
best_model <- comparison$models[[tolower(comparison$recommendation)]]

Customizing Visualizations

All plotting functions use the custom theme_sportminer() theme, but you can customize further.

library(ggplot2)

# Create a plot with custom theme settings
p <- sm_plot_topic_frequency(lda_model, dtm)

# Add customizations
p +
  labs(
    title = "Distribution of Research Topics in Sport Science",
    subtitle = "Based on 100 papers from Scopus (2010-2025)"
  ) +
  theme_sportminer(base_size = 14, grid = FALSE)

Best Practices

  1. API Rate Limits: Scopus has rate limits. Use max_count wisely and add delays between large queries.

  2. Reproducibility: Always set seeds when running topic models:

    sm_train_lda(dtm, k = 10, seed = 1729)
  3. Hyperparameter Tuning: Experiment with min_term_freq and max_term_freq in sm_create_dtm() to balance vocabulary size and model performance.

  4. Model Selection: Don’t rely solely on coherence. Inspect the top terms for each topic to ensure interpretability.

Next Steps

Citation

If you use SportMiner in your research, please cite:

citation("SportMiner")

References

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.