What do Wikipedia’s readers care about? Is Britney Spears more popular than Brittany? Is Asia Carrera more popular than Asia? How many people looked at the article on Santa Claus in December? How many looked at the article on Ron Paul?
What can you find?
Source: http://stats.grok.se/
The wikipediatrend package provides convenience access to daily page view counts (Wikipedia article traffic statistics) stored at http://stats.grok.se/ .
If you want to know how often an article has been viewed over time and work with the data from within R, this package is for you. Maybe you want to compare how much attention articles from different languages got and when, this package is for you. Are you up to policy studies or epidemiology? Have a look at page counts for Flue, Ebola, Climate Change or Millennium Development Goals and maybe build a model or two. Again, this package is for you.
If you simply want to browse Wikipedia page view statistics without all that coding, visit http://stats.grok.se/ and have a look around.
If non-big data is not an option, get the raw data in their entity at http://dumps.wikimedia.org/other/pagecounts-raw/ .
If you think days are crude measures of time but seconds might do if need be and info about which article views led to the numbers is useless anyways - go to http://datahub.io/dataset/english-wikipedia-pageviews-by-second.
To get further information on the data source (Who? When? How? How good?) there is a Wikipedia article for that: http://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics and another one: http://en.wikipedia.org/wiki/Wikipedia:About_page_view_statistics .
stable CRAN version
install.packages("wikipediatrend")
developemnt version
devtools::install_github("petermeissner/wikipediatrend")
… and load it via:
library(wikipediatrend)
The workhorse of the package is the wp_trend()
function that allows you to get page view counts as neat data frames like this:
page_views <- wp_trend("main_page")
page_views
## date count lang page rank month title
## 1 2015-04-30 18102799 en Main_page 2 201504 Main_page
## 2 2015-04-08 18297719 en Main_page 2 201504 Main_page
## 3 2015-04-09 18048572 en Main_page 2 201504 Main_page
## 4 2015-04-01 14418119 en Main_page 2 201504 Main_page
## 5 2015-04-02 11297180 en Main_page 2 201504 Main_page
## 6 2015-04-03 13383207 en Main_page 2 201504 Main_page
## 7 2015-04-04 17081542 en Main_page 2 201504 Main_page
## 8 2015-04-05 16332148 en Main_page 2 201504 Main_page
## 9 2015-04-06 19546248 en Main_page 2 201504 Main_page
## 10 2015-04-07 18572607 en Main_page 2 201504 Main_page
## 11 2015-04-26 18405928 en Main_page 2 201504 Main_page
## 12 2015-04-27 19850863 en Main_page 2 201504 Main_page
## 13 2015-04-24 14112564 en Main_page 2 201504 Main_page
## 14 2015-04-25 18455569 en Main_page 2 201504 Main_page
## 15 2015-04-22 19296839 en Main_page 2 201504 Main_page
## 16 2015-04-23 17269962 en Main_page 2 201504 Main_page
## 17 2015-04-20 19519351 en Main_page 2 201504 Main_page
## 18 2015-04-21 19048473 en Main_page 2 201504 Main_page
## 19 2015-04-28 19779863 en Main_page 2 201504 Main_page
## 20 2015-04-29 19643945 en Main_page 2 201504 Main_page
## 21 2015-04-19 18013169 en Main_page 2 201504 Main_page
## 22 2015-04-18 17032359 en Main_page 2 201504 Main_page
## 23 2015-04-17 13083942 en Main_page 2 201504 Main_page
## 24 2015-04-16 16400116 en Main_page 2 201504 Main_page
## 25 2015-04-15 18357760 en Main_page 2 201504 Main_page
## 26 2015-04-14 18210477 en Main_page 2 201504 Main_page
## 27 2015-04-13 21138174 en Main_page 2 201504 Main_page
## 28 2015-04-12 17966427 en Main_page 2 201504 Main_page
## 29 2015-04-11 17118818 en Main_page 2 201504 Main_page
## 30 2015-04-10 14087993 en Main_page 2 201504 Main_page
… that can easily be turned into a plot …
library(ggplot2)
ggplot(page_views, aes(x=date, y=count)) +
geom_line(size=1.5, colour="steelblue") +
geom_smooth(method="loess", colour="#00000000", fill="#001090", alpha=0.1) +
scale_y_continuous( breaks=c(10e6, 15e6, 20e6),
label=c("10 M","15 M","20 M")) +
theme_bw()
wp_trend()
optionswp_trend()
has several options and most of them are set to defaults:
page , from = prev_month_start(), to = prev_month_end(), lang = “en”, file = wp_cache_file(),
page
from = Sys.Date() - 30
to = Sys.Date()
lang = "en"
file = wp_cache_file()
friendly
requestFrom
userAgent
page
The page
option allows to specify one or more article titles for which data should be retrieved.
These titles should be in the same format as shown in the address bar of your browser to ensure that the pages are found. If we want to get page views for the United Nations Millennium Development Goals and the article is found here: “http://en.wikipedia.org/wiki/Millennium_Development_Goals” the page title to pass to wp_trend()
should be Millennium_Development_Goals not Millennium Development Goals or Millennium_development_goals or amy other ‘mostly-like-the-original’ variation.
To ease data gathering wp_trend()
page
accepts whole vectors of page titles and will retrieve date for each one after another.
page_views <- wp_trend( page = c( "Millennium_Development_Goals",
"Climate_Change") )
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, group=page, color=page)) +
geom_line(size=1.5) + theme_bw()
from
and to
These two options determine the time frame for which data shall be retrieved. The defaults are set to gather the last 30 days but might be set to cover larger time frames as well. Note that there is no data prior to December 2007 so that any date prior will be set to this minimum.
page_views <- wp_trend(
page = "Millennium_Development_Goals" ,
from = "2000-01-01",
to = prev_month_end())
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, color=wp_year(date))) +
geom_line() +
stat_smooth(method = "lm", formula = y ~ poly(x, 22), color="#CD0000a0", size=1.2) +
theme_bw()
lang
This option determines for which Wikipedia the page views shall be retrieved, English, German, Chinese, Spanish, … . The default is set to "en"
for the English Wikipedia. This option should get one language shorthand that then is used for all pages or for each page a corresponding language shorthand should be specified.
page_views <- wp_trend(
page = c("Objetivos_de_Desarrollo_del_Milenio",
"Millennium_Development_Goals") ,
lang = c("es", "en"),
from = Sys.Date()-100
)
library(ggplot2)
ggplot(page_views, aes(x=date, y=count, group=lang, color=lang, fill=lang)) +
geom_smooth(size=1.5) +
geom_point() +
theme_bw()
file
This last option defines where the package should cache the data to prevent unecessary downloads of already existing data. The default is to use and reuse a file in the temporary folder but can be replaced by any valid filename, e.g. file = MyCache.csv
.
To get the path and name of the default cache file use the wp_cache_file()
function:
wp_cache_file()
## [1] "C:/Users/Peter/AppData/Local/Temp/wikipediatrend_cache.csv"
While wp_trend()
will never return more data than specified by options page
, lang
, from
, and to
wp_get_cache()
can be used to retrieve all the data cached so far:
cache <- wp_get_cache()
head(cache)
## date count lang page rank month title
## 1 2015-04-30 425 en Everest -1 201504 Everest
## 2 2015-04-08 305 en Everest -1 201504 Everest
## 3 2015-04-09 318 en Everest -1 201504 Everest
## 4 2015-04-01 315 en Everest -1 201504 Everest
## 5 2015-04-02 312 en Everest -1 201504 Everest
## 6 2015-04-03 275 en Everest -1 201504 Everest
dim(cache)
## [1] 5498 7
Last but not least the cache (file), that might exist across multiple sessions, can also be reset:
wp_cache_reset()
… or set to another file - this option stays for the rest of the R-session or if set to something else again:
# save apth of curent cache file
tmp <- wp_cache_file()
# set cache file
wp_set_cache_file("My_Other_Cache_File.csv")
wp_cache_file()
## [1] "My_Other_Cache_File.csv"
wp_get_cache()
## data frame with 0 columns and 0 rows
# set cache file back
wp_set_cache_file(tmp)
wp_cache_file()
## [1] "C:/Users/Peter/AppData/Local/Temp/wikipediatrend_cache.csv"
wp_get_cache()
## date count lang page rank month
## 3297 2015-01-08 48 ar %D8%AF%D8%A7 ... -1 201501
## 3313 2015-01-29 61 ar %D8%AF%D8%A7 ... -1 201501
## 3640 2015-03-03 4457 de Islamischer_ ... -1 201503
## 3624 2015-03-11 6661 de Islamischer_ ... -1 201503
## 5471 2014-11-23 258 de K%C3%A4se 6057 201411
## 3025 2015-01-07 24185 en Islamic_Stat ... -1 201501
## 3023 2015-01-09 30053 en Islamic_Stat ... -1 201501
## 3065 2015-02-11 51137 en Islamic_Stat ... -1 201502
## 3078 2015-03-11 39016 en Islamic_Stat ... -1 201503
## 384 2008-09-30 675 en Millennium_D ... 7435 200809
## 636 2009-05-10 496 en Millennium_D ... 7435 200905
## 870 2010-01-27 888 en Millennium_D ... 7435 201001
## 929 2010-03-04 1080 en Millennium_D ... 7435 201003
## 1129 2010-10-29 1631 en Millennium_D ... 7435 201010
## 1294 2011-03-24 1908 en Millennium_D ... 7435 201103
## 1629 2012-02-10 1494 en Millennium_D ... 7435 201202
## 2288 2013-12-20 1177 en Millennium_D ... 7435 201312
## 2369 2014-02-17 2456 en Millennium_D ... 7435 201402
## 2475 2014-06-09 1803 en Millennium_D ... 7435 201406
## 4266 2012-02-15 15046 en Syria 1802 201202
## 4730 2013-05-04 9450 en Syria 1802 201305
## 4992 2014-02-01 4179 en Syria 1802 201402
## 5227 2014-10-01 5915 en Syria 1802 201410
## 3755 2014-10-26 156 es Estado_Isl%C ... -1 201410
## 3820 2014-12-11 119 es Estado_Isl%C ... -1 201412
## 3937 2015-04-21 3025 es Estado_Isl%C ... -1 201504
## 4046 2014-11-25 5649 ru %D0%98%D1%81 ... -1 201411
## 4140 2015-02-25 3954 ru %D0%98%D1%81 ... -1 201502
## 4187 2015-03-02 4954 ru %D0%98%D1%81 ... -1 201503
## title
## 3297 <U+062F><U+0627><U+0639><U+0634>
## 3313 <U+062F><U+0627><U+0639><U+0634>
## 3640 Islamischer_ ...
## 3624 Islamischer_ ...
## 5471 Käse
## 3025 Islamic_Stat ...
## 3023 Islamic_Stat ...
## 3065 Islamic_Stat ...
## 3078 Islamic_Stat ...
## 384 Millennium_D ...
## 636 Millennium_D ...
## 870 Millennium_D ...
## 929 Millennium_D ...
## 1129 Millennium_D ...
## 1294 Millennium_D ...
## 1629 Millennium_D ...
## 2288 Millennium_D ...
## 2369 Millennium_D ...
## 2475 Millennium_D ...
## 4266 Syria
## 4730 Syria
## 4992 Syria
## 5227 Syria
## 3755 Estado_Islám ...
## 3820 Estado_Islám ...
## 3937 Estado_Islám ...
## 4046 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 4140 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
## 4187 <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
##
## ... 5469 rows of data not shown
If comparing languages is important one needs to specify the exact article titles for each language: While the article about the Millennium Goals has an English title in the English Wikipedia, it of course is named differently in Spanish, German, Chinese, … . One might look these titles up by hand or use the handy wp_linked_pages()
function like this:
titles <- wp_linked_pages("Islamic_State_of_Iraq_and_the_Levant", "en")
titles <- titles[titles$lang %in% c("en", "de", "es", "ar", "ru"),]
titles
## page lang title
## 1 Islamic_Stat ... en Islamic_Stat ...
## 2 %D8%AF%D8%A7 ... ar <U+062F><U+0627><U+0639><U+0634>
## 3 Islamischer_ ... de Islamischer_ ...
## 4 Estado_Isl%C ... es Estado_Islám ...
## 5 %D0%98%D1%81 ... ru <U+0418><U+0441><U+043B><U+0430><U+043C><U+0441><U+043A><U+043E><U+0435>_<U+0433><U+043E> ...
… then we can use the information to get data for several languages …
page_views <- wp_trend(page = titles$page[1:5],
lang = titles$lang[1:5],
from = "2014-08-01")
library(ggplot2)
for(i in unique(page_views$lang) ){
iffer <- page_views$lang==i
page_views[iffer, ]$count <- scale(page_views[iffer, ]$count)
}
ggplot(page_views, aes(x=date, y=count, group=lang, color=lang)) +
geom_line(size=1.2, alpha=0.5) +
ylab("standardized count\n(by lang: m=0, var=1)") +
theme_bw() +
scale_colour_brewer(palette="Set1") +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
AnomalyDetection
Currently the AnomalyDetection
package is not availible on CRAN so we have to use install_github()
from the devtools
package to get it.
install.packages(
"AnomalyDetection",
repos="http://ghrr.github.io/drat",
type="source"
)
library(AnomalyDetection)
library(dplyr)
library(ggplot2)
The package is a little picky about the data it accepts for processing so we have to build a new data frame. It should contain only the date and count variable. Furthermore, date
should be named timestamp
and transformed to type POSIXct
.
page_views <- wp_trend("Syria", from = "2012-01-01")
## .
page_views_br <-
page_views %>%
select(date, count) %>%
rename(timestamp=date) %>%
unclass() %>%
as.data.frame() %>%
mutate(timestamp = as.POSIXct(timestamp))
Having transformed the data we can detect anomalies via AnomalyDetectionTs()
. The function offers various options e.g. the significance level for rejecting normal values (alpha
); the maximum fraction of the data that is allowed to be detected as anomalies (max_amoms
); whether or not upward deviations, downward devaitions or irregularities in both directions might form the basis of anomaly detection (direction
) and last but not least whether or not the time frame for detection is larger than one month (lonterm
).
Lets choose a greedy set of parameters and detect possible anomalies:
res <-
AnomalyDetectionTs(
x = page_views_br,
alpha = 0.05,
max_anoms = 0.40,
direction = "both",
longterm = T
)$anoms
res$timestamp <- as.Date(res$timestamp)
head(res)
## timestamp anoms
## 1 2012-01-01 4816
## 2 2012-01-07 5533
## 3 2012-01-13 6725
## 4 2012-01-12 8336
## 5 2012-01-11 9301
## 6 2012-01-10 8753
… and play back the detected anomalies to our page_views
data set:
page_views <-
page_views %>%
mutate(normal = !(page_views$date %in% res$timestamp)) %>%
mutate(anom = page_views$date %in% res$timestamp )
class(page_views) <- c("wp_df", "data.frame")
Now we can plot counts and anomalies …
(
p <-
ggplot( data=page_views, aes(x=date, y=count) ) +
geom_line(color="steelblue") +
geom_point(data=filter(page_views, anom==T), color="red2", size=2) +
theme_bw()
)
… as well as compare running means:
p +
geom_line(stat = "smooth", size=2, color="red2", alpha=0.7) +
geom_line(data=filter(page_views, anom==F),
stat = "smooth", size=2, color="dodgerblue4", alpha=0.5)
It seems like upward and downward anomalies partial each other out most of the time since both smooth lines (with and without anomalies) do not differ much. Nonetheless, keeping anomalies in will upward bias the counts slightly, so we proceed with a cleaned up data set:
page_views_clean <-
page_views %>%
filter(anom==F) %>%
select(date, count, lang, page, rank, month, title)
page_views_br_clean <-
page_views_br %>%
filter(page_views$anom==F)
BreakoutDetection
BreakoutDetection
is a package that allows to search data for mean level shifts by dividing it into timespans of change and those of stability in the presence of seasonal noise. Similar to AnomalyDetection
the BreakoutDetection
package is not available on CRAN but has to be obtained from Github.
install.packages(
"BreakoutDetection",
repos="http://ghrr.github.io/drat",
type="source"
)
library(BreakoutDetection)
library(dplyr)
library(ggplot2)
library(magrittr)
… again the workhorse function (breakout()
) is picky and requires “a data.frame which has ‘timestamp’ and ‘count’ components” like our page_views_br_clean
.
The function has two general options: one tweaks the minimum length of a timespan (min.size
); the other one does determine how many mean level changes might occur during the whole time frame (method
); and several method specific options, e.g. decree
, beta
, and percent
which control the sensitivity adding further breakpoints. In the following case the last option tells the function that overall model fit should be increased by at least 5 percent if adding a breakpoint.
br <-
breakout(
page_views_br_clean,
min.size = 30,
method = 'multi',
percent = 0.05,
plot = TRUE
)
br
## $loc
## [1] 30 60 138 170 502 532
##
## $time
## [1] 0.7
##
## $pval
## [1] NA
##
## $plot
In the following snippet we combine the break information with our page views data and can have a look at the dates at which the breaks occured.
breaks <- page_views_clean[br$loc,]
breaks
## date count lang page rank month title
## 30 2012-02-14 17177 en Syria 1802 201202 Syria
## 60 2012-03-07 13366 en Syria 1802 201203 Syria
## 138 2012-06-30 8315 en Syria 1802 201206 Syria
## 170 2012-07-27 17383 en Syria 1802 201207 Syria
## 502 2013-08-02 5171 en Syria 1802 201308 Syria
## 532 2013-10-06 5997 en Syria 1802 201310 Syria
Next, we add a span variable capturing which page_view observations belong to which span, allowing us to aggregate data.
page_views_clean$span <- 0
for (d in breaks$date ) {
page_views_clean$span[ page_views_clean$date > d ] %<>% add(1)
}
page_views_clean$mcount <- 0
for (s in unique(page_views_clean$span) ) {
iffer <- page_views_clean$span == s
page_views_clean$mcount[ iffer ] <- mean(page_views_clean$count[iffer])
}
spans <-
page_views_clean %>%
as_data_frame() %>%
group_by(span) %>%
summarize(
start = min(date),
end = max(date),
length = end-start,
mean_count = round(mean(count)),
min_count = min(count),
max_count = max(count),
var_count = var(count)
)
spans
## Source: local data frame [7 x 8]
##
## span start end length mean_count min_count max_count
## 1 0 2012-01-02 2012-02-14 43 11640 5280 36378
## 2 1 2012-02-15 2012-03-07 21 14888 12324 23197
## 3 2 2012-03-08 2012-06-30 114 8834 4222 23467
## 4 3 2012-07-03 2012-07-27 24 13633 6464 31744
## 5 4 2012-07-28 2013-08-02 370 7681 0 20874
## 6 5 2013-08-03 2013-10-06 64 23434 4121 96840
## 7 6 2013-10-08 2015-04-30 569 4916 0 19187
## Variables not shown: var_count (dbl)
Also, we can now plot the shifting mean.
ggplot(page_views_clean, aes(x=date, y=count) ) +
geom_line(alpha=0.5, color="steelblue") +
geom_line(aes(y=mcount), alpha=0.5, color="red2", size=1.2) +
theme_bw()