What do Wikipedia’s readers care about? Is Britney Spears more popular than Brittany? Is Asia Carrera more popular than Asia? How many people looked at the article on Santa Claus in December? How many looked at the article on Ron Paul?
What can you find?
Source: http://stats.grok.se/

The wikipediatrend package provides convenience access to daily page view counts (Wikipedia article traffic statistics) stored at http://stats.grok.se/ .

If you want to know how often an article has been viewed over time and work with the data from within R, this package is for you. Maybe you want to compare how much attention articles from different languages got and when, this package is for you. Are you up to policy studies or epidemiology? Have a look at page counts for Flue, Ebola, Climate Change or Millennium Development Goals and maybe build a model or two. Again, this package is for you.

If you simply want to browse Wikipedia page view statistics without all that coding, visit http://stats.grok.se/ and have a look around.

If non-big data is not an option, get the raw data in their entity at http://dumps.wikimedia.org/other/pagecounts-raw/ .

If you think days are crude measures of time but seconds might do if need be and info about which article views led to the numbers is useless anyways - go to http://datahub.io/dataset/english-wikipedia-pageviews-by-second.

To get further information on the data source (Who? When? How? How good?) there is a Wikipedia article for that: http://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics and another one: http://en.wikipedia.org/wiki/Wikipedia:About_page_view_statistics .

1 Installation

stable CRAN version

install.packages("wikipediatrend")

developemnt version

devtools::install_github("petermeissner/wikipediatrend")

… and load it via:

library(wikipediatrend)

2 A first try

The workhorse of the package is the wp_trend() function that allows you to get page view counts as neat data frames like this:

page_views <- wp_trend("main_page")
page_views
##    date       count    lang page      rank month  title    
## 1  2015-04-30 18102799 en   Main_page 2    201504 Main_page
## 2  2015-04-08 18297719 en   Main_page 2    201504 Main_page
## 3  2015-04-09 18048572 en   Main_page 2    201504 Main_page
## 4  2015-04-01 14418119 en   Main_page 2    201504 Main_page
## 5  2015-04-02 11297180 en   Main_page 2    201504 Main_page
## 6  2015-04-03 13383207 en   Main_page 2    201504 Main_page
## 7  2015-04-04 17081542 en   Main_page 2    201504 Main_page
## 8  2015-04-05 16332148 en   Main_page 2    201504 Main_page
## 9  2015-04-06 19546248 en   Main_page 2    201504 Main_page
## 10 2015-04-07 18572607 en   Main_page 2    201504 Main_page
## 11 2015-04-26 18405928 en   Main_page 2    201504 Main_page
## 12 2015-04-27 19850863 en   Main_page 2    201504 Main_page
## 13 2015-04-24 14112564 en   Main_page 2    201504 Main_page
## 14 2015-04-25 18455569 en   Main_page 2    201504 Main_page
## 15 2015-04-22 19296839 en   Main_page 2    201504 Main_page
## 16 2015-04-23 17269962 en   Main_page 2    201504 Main_page
## 17 2015-04-20 19519351 en   Main_page 2    201504 Main_page
## 18 2015-04-21 19048473 en   Main_page 2    201504 Main_page
## 19 2015-04-28 19779863 en   Main_page 2    201504 Main_page
## 20 2015-04-29 19643945 en   Main_page 2    201504 Main_page
## 21 2015-04-19 18013169 en   Main_page 2    201504 Main_page
## 22 2015-04-18 17032359 en   Main_page 2    201504 Main_page
## 23 2015-04-17 13083942 en   Main_page 2    201504 Main_page
## 24 2015-04-16 16400116 en   Main_page 2    201504 Main_page
## 25 2015-04-15 18357760 en   Main_page 2    201504 Main_page
## 26 2015-04-14 18210477 en   Main_page 2    201504 Main_page
## 27 2015-04-13 21138174 en   Main_page 2    201504 Main_page
## 28 2015-04-12 17966427 en   Main_page 2    201504 Main_page
## 29 2015-04-11 17118818 en   Main_page 2    201504 Main_page
## 30 2015-04-10 14087993 en   Main_page 2    201504 Main_page

… that can easily be turned into a plot …

library(ggplot2)
ggplot(page_views, aes(x=date, y=count)) + 
  geom_line(size=1.5, colour="steelblue") + 
  geom_smooth(method="loess", colour="#00000000", fill="#001090", alpha=0.1) +
  scale_y_continuous( breaks=c(10e6, 15e6, 20e6), 
                      label=c("10 M","15 M","20 M")) +
  theme_bw()

3 wp_trend() options

wp_trend() has several options and most of them are set to defaults:

page , from = prev_month_start(), to = prev_month_end(), lang = “en”, file = wp_cache_file(),

3.1 page

The page option allows to specify one or more article titles for which data should be retrieved.

These titles should be in the same format as shown in the address bar of your browser to ensure that the pages are found. If we want to get page views for the United Nations Millennium Development Goals and the article is found here: http://en.wikipedia.org/wiki/Millennium_Development_Goals the page title to pass to wp_trend() should be Millennium_Development_Goals not Millennium Development Goals or Millennium_development_goals or amy other ‘mostly-like-the-original’ variation.

To ease data gathering wp_trend() page accepts whole vectors of page titles and will retrieve date for each one after another.

page_views <- wp_trend( page = c( "Millennium_Development_Goals",
                                  "Climate_Change") )
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, group=page, color=page)) + 
  geom_line(size=1.5) + theme_bw()

3.2 from and to

These two options determine the time frame for which data shall be retrieved. The defaults are set to gather the last 30 days but might be set to cover larger time frames as well. Note that there is no data prior to December 2007 so that any date prior will be set to this minimum.

page_views <- wp_trend( 
                page = "Millennium_Development_Goals" ,
                from = "2000-01-01",
                to   = prev_month_end())
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, color=wp_year(date))) + 
  geom_line() + 
  stat_smooth(method = "lm", formula = y ~ poly(x, 22), color="#CD0000a0", size=1.2) +
  theme_bw() 

3.3 lang

This option determines for which Wikipedia the page views shall be retrieved, English, German, Chinese, Spanish, … . The default is set to "en" for the English Wikipedia. This option should get one language shorthand that then is used for all pages or for each page a corresponding language shorthand should be specified.

page_views <- wp_trend( 
                page = c("Objetivos_de_Desarrollo_del_Milenio",
                         "Millennium_Development_Goals") ,
                lang = c("es", "en"),
                from = Sys.Date()-100
              )
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, group=lang, color=lang, fill=lang)) + 
  geom_smooth(size=1.5) + 
  geom_point() +
  theme_bw() 

3.4 file

This last option defines where the package should cache the data to prevent unecessary downloads of already existing data. The default is to use and reuse a file in the temporary folder but can be replaced by any valid filename, e.g. file = MyCache.csv.

To get the path and name of the default cache file use the wp_cache_file() function:

wp_cache_file()
## [1] "C:/Users/Peter/AppData/Local/Temp/wikipediatrend_cache.csv"

While wp_trend() will never return more data than specified by options page, lang, from, and to wp_get_cache() can be used to retrieve all the data cached so far:

cache <- wp_get_cache()
head(cache)
##   date       count lang page    rank month  title  
## 1 2015-04-30 425   en   Everest -1   201504 Everest
## 2 2015-04-08 305   en   Everest -1   201504 Everest
## 3 2015-04-09 318   en   Everest -1   201504 Everest
## 4 2015-04-01 315   en   Everest -1   201504 Everest
## 5 2015-04-02 312   en   Everest -1   201504 Everest
## 6 2015-04-03 275   en   Everest -1   201504 Everest
dim(cache)
## [1] 5498    7

Last but not least the cache (file), that might exist across multiple sessions, can also be reset:

wp_cache_reset()

… or set to another file - this option stays for the rest of the R-session or if set to something else again:

# save apth of curent cache file
tmp <- wp_cache_file()

# set cache file 
wp_set_cache_file("My_Other_Cache_File.csv")

wp_cache_file()
## [1] "My_Other_Cache_File.csv"
wp_get_cache()
## data frame with 0 columns and 0 rows
# set cache file back
wp_set_cache_file(tmp)

wp_cache_file()
## [1] "C:/Users/Peter/AppData/Local/Temp/wikipediatrend_cache.csv"
wp_get_cache()
##      date       count    lang page             rank month 
## 3297 2015-01-08       48 ar   %D8%AF%D8%A7 ... -1   201501
## 3313 2015-01-29       61 ar   %D8%AF%D8%A7 ... -1   201501
## 3640 2015-03-03     4457 de   Islamischer_ ... -1   201503
## 3624 2015-03-11     6661 de   Islamischer_ ... -1   201503
## 5471 2014-11-23      258 de   K%C3%A4se        6057 201411
## 3025 2015-01-07    24185 en   Islamic_Stat ... -1   201501
## 3023 2015-01-09    30053 en   Islamic_Stat ... -1   201501
## 3065 2015-02-11    51137 en   Islamic_Stat ... -1   201502
## 3078 2015-03-11    39016 en   Islamic_Stat ... -1   201503
## 384  2008-09-30      675 en   Millennium_D ... 7435 200809
## 636  2009-05-10      496 en   Millennium_D ... 7435 200905
## 870  2010-01-27      888 en   Millennium_D ... 7435 201001
## 929  2010-03-04     1080 en   Millennium_D ... 7435 201003
## 1129 2010-10-29     1631 en   Millennium_D ... 7435 201010
## 1294 2011-03-24     1908 en   Millennium_D ... 7435 201103
## 1629 2012-02-10     1494 en   Millennium_D ... 7435 201202
## 2288 2013-12-20     1177 en   Millennium_D ... 7435 201312
## 2369 2014-02-17     2456 en   Millennium_D ... 7435 201402
## 2475 2014-06-09     1803 en   Millennium_D ... 7435 201406
## 4266 2012-02-15    15046 en   Syria            1802 201202
## 4730 2013-05-04     9450 en   Syria            1802 201305
## 4992 2014-02-01     4179 en   Syria            1802 201402
## 5227 2014-10-01     5915 en   Syria            1802 201410
## 3755 2014-10-26      156 es   Estado_Isl%C ... -1   201410
## 3820 2014-12-11      119 es   Estado_Isl%C ... -1   201412
## 3937 2015-04-21     3025 es   Estado_Isl%C ... -1   201504
## 4046 2014-11-25     5649 ru   %D0%98%D1%81 ... -1   201411
## 4140 2015-02-25     3954 ru   %D0%98%D1%81 ... -1   201502
## 4187 2015-03-02     4954 ru   %D0%98%D1%81 ... -1   201503
##      title           
## 3297 <U+062F><U+0627><U+0639><U+0634>
## 3313 <U+062F><U+0627><U+0639><U+0634>
## 3640 Islamischer_ ...
## 3624 Islamischer_ ...
## 5471 Käse            
## 3025 Islamic_Stat ...
## 3023 Islamic_Stat ...
## 3065 Islamic_Stat ...
## 3078 Islamic_Stat ...
## 384  Millennium_D ...
## 636  Millennium_D ...
## 870  Millennium_D ...
## 929  Millennium_D ...
## 1129 Millennium_D ...
## 1294 Millennium_D ...
## 1629 Millennium_D ...
## 2288 Millennium_D ...
## 2369 Millennium_D ...
## 2475 Millennium_D ...
## 4266 Syria           
## 4730 Syria           
## 4992 Syria           
## 5227 Syria           
## 3755 Estado_Islám ...
## 3820 Estado_Islám ...
## 3937 Estado_Islám ...
## 4046 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 4140 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 4187 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 
## ... 5469 rows of data not shown

4 Counts for other languages

If comparing languages is important one needs to specify the exact article titles for each language: While the article about the Millennium Goals has an English title in the English Wikipedia, it of course is named differently in Spanish, German, Chinese, … . One might look these titles up by hand or use the handy wp_linked_pages() function like this:

titles <- wp_linked_pages("Islamic_State_of_Iraq_and_the_Levant", "en")
titles <- titles[titles$lang %in% c("en", "de", "es", "ar", "ru"),]
titles 
##   page             lang title           
## 1 Islamic_Stat ... en   Islamic_Stat ...
## 2 %D8%AF%D8%A7 ... ar   <U+062F><U+0627><U+0639><U+0634>
## 3 Islamischer_ ... de   Islamischer_ ...
## 4 Estado_Isl%C ... es   Estado_Islám ...
## 5 %D0%98%D1%81 ... ru   <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...

… then we can use the information to get data for several languages …

page_views <- wp_trend(page = titles$page[1:5], 
                       lang = titles$lang[1:5],
                       from = "2014-08-01")
library(ggplot2)

for(i in unique(page_views$lang) ){
  iffer <- page_views$lang==i
  page_views[iffer, ]$count <- scale(page_views[iffer, ]$count)
}

ggplot(page_views, aes(x=date, y=count, group=lang, color=lang)) + 
  geom_line(size=1.2, alpha=0.5) + 
  ylab("standardized count\n(by lang: m=0, var=1)") +
  theme_bw() + 
  scale_colour_brewer(palette="Set1") + 
  guides(colour = guide_legend(override.aes = list(alpha = 1)))

5 Going beyond Wikipediatrend – Anomalies and mean shifts

5.1 Identifying anomalies with AnomalyDetection

Currently the AnomalyDetection package is not availible on CRAN so we have to use install_github() from the devtools package to get it.

install.packages(
  "AnomalyDetection", 
  repos="http://ghrr.github.io/drat", 
  type="source"
)
library(AnomalyDetection)
library(dplyr)
library(ggplot2)

The package is a little picky about the data it accepts for processing so we have to build a new data frame. It should contain only the date and count variable. Furthermore, date should be named timestamp and transformed to type POSIXct.

page_views <- wp_trend("Syria", from = "2012-01-01")
## .
page_views_br <- 
  page_views  %>% 
  select(date, count)  %>% 
  rename(timestamp=date)  %>% 
  unclass()  %>% 
  as.data.frame() %>% 
  mutate(timestamp = as.POSIXct(timestamp))

Having transformed the data we can detect anomalies via AnomalyDetectionTs(). The function offers various options e.g. the significance level for rejecting normal values (alpha); the maximum fraction of the data that is allowed to be detected as anomalies (max_amoms); whether or not upward deviations, downward devaitions or irregularities in both directions might form the basis of anomaly detection (direction) and last but not least whether or not the time frame for detection is larger than one month (lonterm).

Lets choose a greedy set of parameters and detect possible anomalies:

res <- 
  AnomalyDetectionTs(
    x         = page_views_br, 
    alpha     = 0.05, 
    max_anoms = 0.40,
    direction = "both",
    longterm  = T
  )$anoms
res$timestamp <- as.Date(res$timestamp)

head(res)
##    timestamp anoms
## 1 2012-01-01  4816
## 2 2012-01-07  5533
## 3 2012-01-13  6725
## 4 2012-01-12  8336
## 5 2012-01-11  9301
## 6 2012-01-10  8753

… and play back the detected anomalies to our page_views data set:

page_views <- 
  page_views  %>% 
  mutate(normal = !(page_views$date %in% res$timestamp))  %>% 
  mutate(anom   =   page_views$date %in% res$timestamp )
class(page_views) <- c("wp_df", "data.frame")

Now we can plot counts and anomalies …

(
p <-
  ggplot( data=page_views, aes(x=date, y=count) ) + 
    geom_line(color="steelblue") +
    geom_point(data=filter(page_views, anom==T), color="red2", size=2) +
    theme_bw()
)

… as well as compare running means:

p + 
  geom_line(stat = "smooth", size=2, color="red2", alpha=0.7) + 
  geom_line(data=filter(page_views, anom==F), 
            stat = "smooth", size=2, color="dodgerblue4", alpha=0.5) 

It seems like upward and downward anomalies partial each other out most of the time since both smooth lines (with and without anomalies) do not differ much. Nonetheless, keeping anomalies in will upward bias the counts slightly, so we proceed with a cleaned up data set:

page_views_clean <- 
  page_views  %>% 
  filter(anom==F)  %>% 
  select(date, count, lang, page, rank, month, title)

page_views_br_clean <- 
  page_views_br  %>% 
  filter(page_views$anom==F)

5.2 Identifying mean shifts with BreakoutDetection

BreakoutDetection is a package that allows to search data for mean level shifts by dividing it into timespans of change and those of stability in the presence of seasonal noise. Similar to AnomalyDetection the BreakoutDetection package is not available on CRAN but has to be obtained from Github.

install.packages(
  "BreakoutDetection", 
  repos="http://ghrr.github.io/drat", 
  type="source"
)
library(BreakoutDetection)
library(dplyr)
library(ggplot2)
library(magrittr)

… again the workhorse function (breakout()) is picky and requires “a data.frame which has ‘timestamp’ and ‘count’ components” like our page_views_br_clean.

The function has two general options: one tweaks the minimum length of a timespan (min.size); the other one does determine how many mean level changes might occur during the whole time frame (method); and several method specific options, e.g. decree, beta, and percent which control the sensitivity adding further breakpoints. In the following case the last option tells the function that overall model fit should be increased by at least 5 percent if adding a breakpoint.

br <- 
  breakout(
    page_views_br_clean, 
    min.size = 30, 
    method   = 'multi', 
    percent  = 0.05,
    plot     = TRUE
  )
br
## $loc
## [1]  30  60 138 170 502 532
## 
## $time
## [1] 0.7
## 
## $pval
## [1] NA
## 
## $plot

In the following snippet we combine the break information with our page views data and can have a look at the dates at which the breaks occured.

breaks <- page_views_clean[br$loc,]
breaks
##           date count lang  page rank  month title
## 30  2012-02-14 17177   en Syria 1802 201202 Syria
## 60  2012-03-07 13366   en Syria 1802 201203 Syria
## 138 2012-06-30  8315   en Syria 1802 201206 Syria
## 170 2012-07-27 17383   en Syria 1802 201207 Syria
## 502 2013-08-02  5171   en Syria 1802 201308 Syria
## 532 2013-10-06  5997   en Syria 1802 201310 Syria

Next, we add a span variable capturing which page_view observations belong to which span, allowing us to aggregate data.

page_views_clean$span <- 0
for (d in breaks$date ) {
  page_views_clean$span[ page_views_clean$date > d ] %<>% add(1)
}

page_views_clean$mcount <- 0
for (s in unique(page_views_clean$span) ) {
  iffer <- page_views_clean$span == s
  page_views_clean$mcount[ iffer ] <- mean(page_views_clean$count[iffer])
}

spans <- 
  page_views_clean  %>% 
    as_data_frame() %>% 
    group_by(span) %>% 
    summarize(
      start      = min(date), 
      end        = max(date), 
      length     = end-start,
      mean_count = round(mean(count)),
      min_count  = min(count),
      max_count  = max(count),
      var_count  = var(count)
    )
spans
## Source: local data frame [7 x 8]
## 
##   span      start        end length mean_count min_count max_count
## 1    0 2012-01-02 2012-02-14     43      11640      5280     36378
## 2    1 2012-02-15 2012-03-07     21      14888     12324     23197
## 3    2 2012-03-08 2012-06-30    114       8834      4222     23467
## 4    3 2012-07-03 2012-07-27     24      13633      6464     31744
## 5    4 2012-07-28 2013-08-02    370       7681         0     20874
## 6    5 2013-08-03 2013-10-06     64      23434      4121     96840
## 7    6 2013-10-08 2015-04-30    569       4916         0     19187
## Variables not shown: var_count (dbl)

Also, we can now plot the shifting mean.

ggplot(page_views_clean, aes(x=date, y=count) ) + 
  geom_line(alpha=0.5, color="steelblue") + 
  geom_line(aes(y=mcount), alpha=0.5, color="red2", size=1.2) + 
  theme_bw()