The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette provides an introduction to the R package academictwitteR
. The package is useful solely for querying the Twitter Academic Research Product Track v2. API endpoint.
This version of the Twitter API allows researchers to access larger volumes of Twitter data. For more information on the the Twitter API, including how to apply for access to the Academic Research Product Track, see the Twitter Developer platform.
The following vignette will guide you through how to use the package.
We will begin by describing the thinking behind the development of this package and, specifically, the data storage conventions we have established when querying the API.
The Academic Research Product Track permits the user to access larger volumes of data, over a far longer time range, than was previously possible. From the Twitter release for the new track:
“The Academic Research product track includes full-archive search, as well as increased access and other v2 endpoints and functionality designed to get more precise and complete data for analyzing the public conversation, at no cost for qualifying researchers. Since the Academic Research track includes specialized, greater levels of access, it is reserved solely for non-commercial use”.
The new “v2 endpoints” refer to the v2 API, introduced around the same time as the new Academic Research Product Track. Full details of the v2 endpoints are available on the Twitter Developer platform.
In summary the Academic Research product track allows the authorized user:
Please refer to this vignette on how to obtain your own bearer token. You can supply this bearer token in every request. The more advisable and secure approach is to set up your bearer token in your .Renviron file.
We begin by loading the package with:
library(academictwitteR)
And then launch set_bearer
. This will open your .Renviron file in the home directory. Enter your bearer token as below (the bearer token used below is not real).
For this environment variable to be recognized, you first have to restart R
You can then obtain your bearer token with get_bearer
. This is also the default for all data collection functions.
You can check that this works with:
get_bearer()
#> [1] "AAAAAAAAAAAAAAAAAAAAAPwXWFFlLLDVC6G0PFo4shkDVg02DwVxGQIVKvhPVE3vdV"
academictwitteR
The workhorse function of academictwitteR
for collecting tweets is get_all_tweets()
.
<-
tweets get_all_tweets(
query = "#BlackLivesMatter",
start_tweets = "2020-01-01T00:00:00Z",
end_tweets = "2020-01-05T00:00:00Z",
file = "blmtweets"
)
Here, we are collecting tweets containing a hashtag related to the Black Lives Matter movement over the period January 1, 2020 to January 5, 2020.
Note that once we have stored our bearer token with set_bearer, it will be called within the function automatically.
This query will only capture a maximum of 100 tweets as we have not changed the default
If you have not set your bearer token, you can do also do so within the function as follows:
<-
tweets get_all_tweets(
query = "#BlackLivesMatter",
start_tweets = "2020-01-01T00:00:00Z",
end_tweets = "2020-01-05T00:00:00Z",
bearer_token = "AAAAAAAAAAAAAAAAAAAAAPwXWFFlLLDVC6G0Pg02DwVxGQIVKTHISISNOTAREALTOKEN",
file = "blmtweets"
)
This is not recommended as by convention it is not advisable to keep API authorization tokens within your scripts.
academictwitteR
Given the sizeable increase in the volume of data potentially retrievable with the Academic Research Product Track, it is advisable that researchers establish clear storage conventions to mitigate data loss caused by e.g. the unplanned interruption of an API query.
We first draw your attention first to the file
argument in the code for the API query above.
In the file path, the user can specify the name of a file to be stored with a “.rds” extension, which includes all of the tweet-level information collected for a given query.
Alternatively, the user can specify a data_path
as follows:
<-
tweets get_all_tweets(
query = "#BlackLivesMatter",
start_tweets = "2015-01-01T00:00:00Z",
end_tweets = "2020-01-05T00:00:00Z",
data_path = "data/",
bind_tweets = FALSE,
n= 1000000
)
In the data path, the user can either specify a directory that already exists or name a new directory.
The data is stored in this folder as a series of JSONs. Tweet-level data is stored as a series of JSONs beginning “data_”; User-level data is stored as a series of JSONs beginning “users_”.
Note that the get_all_tweets()
function always returns a data.frame object unless data_path
is specified and bind_tweets
is set to FALSE
.
When collecting large amounts of data, we recommend using the data_path
option with bind_tweets = FALSE
. This mitigates potential data loss in case the query is interrupted, and avoids system memory usage errors.
Note finally that here we are setting an upper limit of tweets of one million. The default limit is set to 100. For almost all applications, users will wish to change this. We can also set n = Inf
if we do not require any upper limit. This will collect all available tweets matching the query.
When bind_tweets
is FALSE
, no data.frame object will be returned. In order to get the tweets into a data.frame, you can then use the bind_tweets()
helper function to bundle the JSONs into a data.frame object for analysis in R as such:
<- bind_tweets(data_path = "data/") tweets
If you want to bundle together the user-level data, you can achieve this with the same helper function. The only change is that user
is now set to TRUE
, meaning we want to bundle user-level data:
<- bind_tweets(data_path = "data/", user = TRUE) users
Note: v0.2 of the package incorporates functionality to convert JSONs into multiple data frame formats. Most usefully, these additions permit the incorporation of user-level and tweet-level data into a single tibble.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.