The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Introduction to quickSentiment

— 1. SETUP: LOAD LIBRARIES —

——————————————————————-

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

# CRAN limits the number of cores used during package checks
cores <- min(2, parallel::detectCores())
registerDoParallel(cores = cores)

— 2. LOAD AND PREPARE TRAINING DATA —

# Look for the file in the installed package first
csv_path <- system.file("extdata", "tweets.csv", package = "quickSentiment")

# Fallback for when you are building the package locally
if (csv_path == "") {
  csv_path <- "../inst/extdata/tweets.csv"
}
tweets <- read.csv(csv_path)
set.seed(123)

— 3. PREPROCESS THE TEXT —

——————————————————————-

Use the pre_process() function from our package to clean the raw text.

This step is done externally to the main pipeline, allowing you to reuse

the same cleaned text for multiple different models or analyses in the future.

tweets$cleaned_text <- pre_process(tweets$Tweet)

## quickSentiment: Retaining negation words (e.g., 'not', 'no', 'never') to preserve sentiment polarity. To apply the strict stopword list instead, set `retain_negations = FALSE`. View qs_negations for more

tweets$sentiment = ifelse(tweets$Avg>0,'P','N')

— 4. RUN THE MAIN TRAINING PIPELINE —

——————————————————————-

This is the core of the package. We call the main pipeline() function

to handle the train/test split, vectorization, model training, and evaluation.

result <- pipeline(
  # --- Define the vectorization method ---
  # Options: "bow" (raw counts), "tf" (term frequency), "tfidf", "binary"
  vect_method = "tf",
  
  # --- Define the model to train ---
  # Options: "logit", "rf", "xgb","nb"
  model_name = "rf",
  
  # --- Specify the data and column names ---
  text_vector = tweets$cleaned_text  ,   # The column with our preprocessed text
  sentiment_vector = tweets$sentiment,    # The column with the target variable
  
  # --- Set vectorization options ---
  # Use n_gram = 2 for unigrams + bigrams, or 1 for just unigrams
  n_gram = 1,
  parallel = cores
)

## --- Running Pipeline: TERM_FREQUENCY + RANDOM_FOREST ---

## Data split: 944 training elements, 237 test elements.

## Vectorizing with TERM_FREQUENCY (ngram=1)...

##   - Fitting BoW model (term_frequency) on training data...

##   - Applying BoW transformation (term_frequency) to new data...

## 
## --- Training Random Forest Model (ranger) ---

## --- Random Forest complete. Returning results. ---

## 
## ======================================================

##  --- quickSentiment Pipeline Complete ---

##  Model Type: RANDOM_FOREST

##  Vectorizer: TERM_FREQUENCY (ngram=1)

##  Test Set Size: 237 rows

##  Accuracy of 75.53% under baseline threshold.

## ======================================================

===================================================================

— 5. EVALUATE THE RESULTS

===================================================================

Get the AUC, ROC and Accuracy at Decile Threshold

evaluate_result<- evaluate_performance(result$probs[,2],result$y_test,"P")
evaluate_result

## =========================================
##  quickSentiment Model Evaluation 
## =========================================
## Target Class:   P 
## 
## --- Global Metrics ---
## ROC AUC:        0.6901 
## PR AUC:         0.4404 
## 
## --- Optimal Thresholds ---
## Best ROC Threshold (Youden's J):  0.2792 
## Best PR Threshold (F1-Score):     0.2792 
## Accuracy at Best PR Threshold:    0.7257 
## 
## --- Threshold Summary Table ---
##  Threshold Accuracy Precision Recall    F1
##        0.0    0.257     0.257  1.000 0.409
##        0.1    0.473     0.307  0.836 0.449
##        0.2    0.633     0.377  0.656 0.479
##        0.3    0.726     0.469  0.492 0.480
##        0.4    0.738     0.480  0.197 0.279
##        0.5    0.755     0.636  0.115 0.194
##        0.6    0.751     0.667  0.066 0.119
##        0.7    0.747     1.000  0.016 0.032
##        0.8    0.743     0.000  0.000 0.000
##        0.9    0.743     0.000  0.000 0.000
##        1.0    0.743     0.000  0.000 0.000
## 
## (Note: Use plot() to view ROC and PR curves)

plot(evaluate_result$roc)

plot(evaluate_result$prc)

# =================================================================== # — 6. PREDICTION ON NEW, UNSEEN DATA — # =================================================================== ## The training is complete. The ‘result’ object now contains our trained ## model and all the necessary “artifacts” for prediction.

predicted_tweets <- predict_sentiment(
  pipeline_object = result,
  tweets$cleaned_text
)

## --- Preparing new data for prediction ---

##   - Applying BoW transformation (term_frequency) to new data...

## --- Making Predictions ---

## --- Prediction Complete ---

head(predicted_tweets)

##   predicted_class    prob_N    prob_P
## 1               P 0.4664830 0.5335170
## 2               P 0.3152195 0.6847805
## 3               P 0.3464905 0.6535095
## 4               P 0.3411345 0.6588655
## 5               P 0.3740126 0.6259874
## 6               P 0.2542101 0.7457899

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.