ANLP is a package which provides all the functionalities to build text prediction model.
Functionalities supported by ANLP package:
This function reads text data from file in specificied encoding.
library(ANLP)
print(length(twitter.data))
## [1] 109091
There are more than 100k tweets in the dataset. Initially we will sample 10k tweets to build our model.
We need to sample 10% of the data. So, we will use SampleTextData function following way:
train.data <- sampleTextData(twitter.data,0.1)
print(length(train.data))
## [1] 10839
head(train.data)
## [1] "Desk put together, room all set up. Oh boy, oh boy"
## [2] "ya ik and i never asked him to follow me i only mentioned him once in one of my tweets- i didnt do anything else"
## [3] "Small market baseball. You, know...for the 99%."
## [4] "nice I watched the whole series, LOVED Julia and her mom Erica was such a badass"
## [5] "I know, I know. Then you kick yourself when the fight goes lopsided. But if the upset DOES happen, wow. Nothing like it."
## [6] "love chris brown"
Now, we have 10k tweets but we can see that data is very impure. There are many punctuations, abbreviations, contractions.
train.data.cleaned <- cleanTextData(train.data)
train.data.cleaned[[1]]$content[1:5]
## [1] "desk put together room all set up oh boy oh boy"
## [2] "ya ik and i never asked him to follow me i only mentioned him once in one of my tweets i didnt do anything else"
## [3] "small market baseball you knowfor the "
## [4] "nice i watched the whole series loved julia and her mom erica was such a badass"
## [5] "i know i know then you kick yourself when the fight goes lopsided but if the upset does happen wow nothing like it"
As we can see, all the texts are now cleaned and looks good :)
Now, next step is to build N-gram models by using our cleaned data corpus.
We will build 1,2,3 gram models and generate term frequency matrix for all the data.
unigramModel <- generateTDM(train.data.cleaned,1)
head(unigramModel)
## word freq
## 13722 the 4256
## 15553 you 2761
## 497 and 1853
## 5129 for 1736
## 9345 not 1452
## 13712 that 1212
bigramModel <- generateTDM(train.data.cleaned,2)
head(bigramModel)
## word freq
## 28339 i am 702
## 31422 it is 470
## 29809 in the 376
## 16407 do not 372
## 21435 for the 332
## 42993 of the 273
trigramModel <- generateTDM(train.data.cleaned,3)
head(trigramModel)
## word freq
## 37375 i do not 131
## 77823 thanks for the 110
## 15410 can not wait 68
## 37222 i can not 65
## 37008 i am not 56
## 49144 looking forward to 53
Good work :) Now we have all 3 models so lets predict.
This function accepts list of all the N-gram models. So, lets merge all the N-gram models in single list.
Note: Remember to merge N-gram models in decending order. (3,2,1 Ngram models)
nGramModelsList <- list(trigramModel,bigramModel,unigramModel)
Lets predict some strings:
testString <- "I am the one who"
predict_Backoff(testString,nGramModelsList)
## [1] "blew"
testString <- "what is my"
predict_Backoff(testString,nGramModelsList)
## [1] "favorite"
testString <- "the best movie"
predict_Backoff(testString,nGramModelsList)
## [1] "about"
Enjoy and free feel to give feedbacks on achalshah20@gmail.com