The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
OpenRefine (formerly Google Refine) is a popular, open source data cleaning software1. rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export, apply data cleaning operations, or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby2. rrefine extends this functionality to R users.
rrefine is available on CRAN:
install.packages("rrefine")
The latest version of the package is also available on GitHub and can be installed via devtools by using the following:
# install.packages("devtools")
::install_github("vpnagraj/rrefine") devtools
library(rrefine)
lateformeeting
rrefine includes a sample “dirty” data set to illustrate its features. This object (lateformeeting
) is a simulated data frame that holds 63 observations of dates, days of the week, numbers of hours slept and indicators of whether or not the subject was on time for work. The data are recorded in inconsistent formats and will require cleaning in order to be parsed correctly by R. You can take a look at how messy things are below:
theDate | what.day.whas.it | sleephours | was.i.on.time.for.work |
---|---|---|---|
1/8/99 | Friday | 7 | Yes |
Janury 15 1999 | friday | 5 | No |
1/25/1999 | Monday | 8 | No |
1/28/1999 | thursday | 8 | Yes |
2/9/99 | Tuesday | six | Yes |
2/11/1999 | Thursday | 7 | No |
February 15, 1999 | Monday | 8 | Yes |
2/19/1999 | Friday | 10 | Yes |
3/3/1999 | Wednesday | 5 | n |
3/4/1999 | Thursday | 9 | Yes |
3/5/1999 | Fridday | 8 | Y |
March 8, 1999 | Monday | 11 | Yes |
3/15/99 | Monday | 8 | yep |
3/16/1999 | Tuesday | 8 | Yes |
3/17/1999 | Wedensday | 6.5 | N |
4/1/99 | Thursday | 7 | Yes |
4/12/1999 | Monday | 8 | Yes |
4/20/1999 | Tuesday | 10 | Yes |
4/27/99 | Tuesday | 7 | Yes |
5/5/1999 | Wednesday | 6 | N |
5/12/1999 | Wednseday | 9 | Yes |
5/13/99 | Thursday | 7 | Yes |
5/17/1999 | monday | 7 and a half | Y |
May 25 1999 | Tuesday | 8 | YES |
5/27/1999 | Thursday | 8 | No |
6/1/1999 | tuesday | 7 | Yes |
6/7/1999 | Monday | 10 | Yes |
6/11/1999 | Friday | 6 | NO |
6/15/99 | Tuesday | 8 | Yes |
6/24/1999 | thursday | 7.5 | Yes |
7/1/1999 | Thursday | 7 | No |
July 6 1999 | Wednesday | 5 | Yes |
7/12/1999 | Monday | 8 | Y |
7/20/1999 | tuesday | 9 | Yes |
7/21/99 | Wednesday | 6 | No |
7/23/1999 | fridya | 8 | Yes |
8/3/1999 | Tuesday | 7 | Yes |
8/5/1999 | Thursday | 6 | Y |
8/16/1999 | Modnay | 8 | Yes |
Aug 27 1999 | Friday | 4 | Yes |
August 31, 1999 | Teusday | 7 | yep |
September 1, 1999 | Wednesday | 11 | Yes |
9/10/1999 | Friday | 7 | Yes |
9/20/1999 | Monday | 7 | No |
9/23/1999 | thursdayy | 7.5 | Yes |
9/28/1999 | Monday | 8 | Yes |
10/1/99 | Friday | 7 | Y |
10/4/1999 | Monday | 6 | Yes |
10/5/1999 | tuesday | 9 | Yes |
10/14/1999 | Thursday | 8 | Yes |
10/22/1999 | Friday | 7 | No |
10/25/99 | Monday | 7 | Yes |
10/27/1999 | Wednesday | 8 | Y |
11/2/1999 | Tuesday | 7 | Yes |
11/3/1999 | Wednseday | 8 | Yes |
11/8/1999 | monday | 5 | Yes |
11/18/99 | Thursday | 8 | No |
12/1/1999 | Wednesday | 7 | Yes |
12/3/1999 | Friday | 6 | N |
12/6/1999 | Monday | 6.5 | Yes |
12/13/1999 | Monday | 9 | Yes |
12/16/1999 | Thursday | 7 | Yes |
12/21/99 | Tuesday | 8 | Y |
refine_upload()
While the data cleaning could be performed using R, the operations here describe a typical scenario for OpenRefine users. The first step to creating a new project is to make sure OpenRefine is installed and running3. By default, the application will run locally at http://127.0.0.1:3333/
. All of the functions in rrefine will assume the default local host name and port, however these can both be overridden4. Additionally, as of v1.1.0
the package will internally connect to the OpenRefine instance using a CSRF token in API requests5. The refine_upload()
function allows you to pass the contents of a delimited text file (csv or tsv) along with a project name (optional) and an argument to automatically open the browser in which OpenRefine is running. The example below demonstrates this workflow using the lateformeeting
sample data:
write.csv(lateformeeting, file = "lateformeeting.csv", row.names = FALSE)
refine_upload(file = "lateformeeting.csv", project.name = "lfm_cleanup", open.browser = TRUE)
With the project uploaded, you can perform any of the desired clean-up procedures in OpenRefine.
refine_operations()
Whether the data in OpenRefine has been uploaded via refine_upload()
or another method, users can programmatically apply operations to projects using refine_operations()
. This function will pass an arbitrary list of data cleaning operations to the specified project. Operations must be defined in valid JSON format6. In addition to the generic refine_operations()
that can flexibly accept any valid JSON operation, the rrefine package includes a series of wrapper functions to perform common data cleaning procedures:
refine_remove_column()
: Remove a column from a projectrefine_add_column()
: Add a column to a projectrefine_rename_column()
: Rename an existing column in a projectrefine_move_column()
: Move a column to a new indexrefine_transform()
: Apply arbitrary text transformationsrefine_to_lower()
: Coerce text to lowercaserefine_to_upper()
: Coerce text to uppercaserefine_to_title()
: Coerce text to title caserefine_to_null()
: Set values to NULL
refine_to_empty()
: Set text values to empty string (""
)refine_to_text()
: Coerce value to stringrefine_to_number()
: Coerce value to numericrefine_to_date()
: Coerce value to daterefine_trim_whitespace()
: Remove leading and trailing whitespacesrefine_collapse_whitespace()
: Collapse consecutive whitespaces to single whitespacerefine_unescape_html()
: Unescape HTML in stringThe example below demonstrates several operations using the lateformeeting
sample data:
refine_add_column(new_column = "dotw_allcaps",
base_column = "what.day.whas.it",
value = "grel:value",
project.name = "lfm_cleanup")
refine_to_upper(column_name = "dotw_allcaps", project.name = "lfm_cleanup")
refine_export(project.name = "lfm_cleanup")$dotw_allcaps
## [1] "FRIDAY" "FRIDAY" "MONDAY" "THURSDAY" "TUESDAY" "THURSDAY"
## [7] "MONDAY" "FRIDAY" "WEDNESDAY" "THURSDAY" "FRIDAY" "MONDAY"
## [13] "MONDAY" "TUESDAY" "WEDNESDAY" "THURSDAY" "MONDAY" "TUESDAY"
## [19] "TUESDAY" "WEDNESDAY" "WEDNESDAY" "THURSDAY" "MONDAY" "TUESDAY"
## [25] "THURSDAY" "TUESDAY" "MONDAY" "FRIDAY" "TUESDAY" "THURSDAY"
## [31] "THURSDAY" "WEDNESDAY" "MONDAY" "TUESDAY" "WEDNESDAY" "FRIDAY"
## [37] "TUESDAY" "THURSDAY" "MONDAY" "FRIDAY" "TUESDAY" "WEDNESDAY"
## [43] "FRIDAY" "MONDAY" "THURSDAY" "MONDAY" "FRIDAY" "MONDAY"
## [49] "TUESDAY" "THURSDAY" "FRIDAY" "MONDAY" "WEDNESDAY" "TUESDAY"
## [55] "WEDNESDAY" "MONDAY" "THURSDAY" "WEDNESDAY" "FRIDAY" "MONDAY"
## [61] "MONDAY" "THURSDAY" "TUESDAY"
refine_remove_column(column = "dotw_allcaps", project.name = "lfm_cleanup")
refine_export()
Once you’ve cleaned up the data in OpenRefine you can pull it back into R for plotting, modeling, etc. by using refine_export()
. This function will accept either the project name or the numerical unique identifier. It is only necessary to use both if there are multiple projects with the same name in your OpenRefine application. Note that the data is exported directly into R as a data frame and you can assign it to a new object.
<- refine_export(project.name = "lfm_cleanup")
lfm_clean lfm_clean
date | dotw | hours.slept | on.time |
---|---|---|---|
1999-01-08T00:00:00Z | Friday | 7.0 | Yes |
1999-01-15T00:00:00Z | Friday | 5.0 | No |
1999-01-25T00:00:00Z | Monday | 8.0 | No |
1999-01-28T00:00:00Z | Thursday | 8.0 | Yes |
1999-02-09T00:00:00Z | Tuesday | 6.0 | Yes |
1999-02-11T00:00:00Z | Thursday | 7.0 | No |
1999-02-15T00:00:00Z | Monday | 8.0 | Yes |
1999-02-19T00:00:00Z | Friday | 10.0 | Yes |
1999-03-03T00:00:00Z | Wednesday | 5.0 | No |
1999-03-04T00:00:00Z | Thursday | 9.0 | Yes |
1999-03-05T00:00:00Z | Friday | 8.0 | Yes |
1999-03-08T00:00:00Z | Monday | 11.0 | Yes |
1999-03-15T00:00:00Z | Monday | 8.0 | Yes |
1999-03-16T00:00:00Z | Tuesday | 8.0 | Yes |
1999-03-17T00:00:00Z | Wednesday | 6.5 | No |
1999-04-01T00:00:00Z | Thursday | 7.0 | Yes |
1999-04-12T00:00:00Z | Monday | 8.0 | Yes |
1999-04-20T00:00:00Z | Tuesday | 10.0 | Yes |
1999-04-27T00:00:00Z | Tuesday | 7.0 | Yes |
1999-05-05T00:00:00Z | Wednesday | 6.0 | No |
1999-05-12T00:00:00Z | Wednesday | 9.0 | Yes |
1999-05-13T00:00:00Z | Thursday | 7.0 | Yes |
1999-05-17T00:00:00Z | Monday | 7.5 | Yes |
1999-05-25T00:00:00Z | Tuesday | 8.0 | Yes |
1999-05-27T00:00:00Z | Thursday | 8.0 | No |
1999-06-01T00:00:00Z | Tuesday | 7.0 | Yes |
1999-06-07T00:00:00Z | Monday | 10.0 | Yes |
1999-06-11T00:00:00Z | Friday | 6.0 | No |
1999-06-15T00:00:00Z | Tuesday | 8.0 | Yes |
1999-06-24T00:00:00Z | Thursday | 7.5 | Yes |
1999-07-01T00:00:00Z | Thursday | 7.0 | No |
1999-07-06T00:00:00Z | Wednesday | 5.0 | Yes |
1999-07-12T00:00:00Z | Monday | 8.0 | Yes |
1999-07-20T00:00:00Z | Tuesday | 9.0 | Yes |
1999-07-21T00:00:00Z | Wednesday | 6.0 | No |
1999-07-23T00:00:00Z | Friday | 8.0 | Yes |
1999-08-03T00:00:00Z | Tuesday | 7.0 | Yes |
1999-08-05T00:00:00Z | Thursday | 6.0 | Yes |
1999-08-16T00:00:00Z | Monday | 8.0 | Yes |
1999-08-27T00:00:00Z | Friday | 4.0 | Yes |
1999-08-31T00:00:00Z | Tuesday | 7.0 | Yes |
1999-09-01T00:00:00Z | Wednesday | 11.0 | Yes |
1999-09-10T00:00:00Z | Friday | 7.0 | Yes |
1999-09-20T00:00:00Z | Monday | 7.0 | No |
1999-09-23T00:00:00Z | Thursday | 7.5 | Yes |
1999-09-28T00:00:00Z | Monday | 8.0 | Yes |
1999-10-01T00:00:00Z | Friday | 7.0 | Yes |
1999-10-04T00:00:00Z | Monday | 6.0 | Yes |
1999-10-05T00:00:00Z | Tuesday | 9.0 | Yes |
1999-10-14T00:00:00Z | Thursday | 8.0 | Yes |
1999-10-22T00:00:00Z | Friday | 7.0 | No |
1999-10-25T00:00:00Z | Monday | 7.0 | Yes |
1999-10-27T00:00:00Z | Wednesday | 8.0 | Yes |
1999-11-02T00:00:00Z | Tuesday | 7.0 | Yes |
1999-11-03T00:00:00Z | Wednesday | 8.0 | Yes |
1999-11-08T00:00:00Z | Monday | 5.0 | Yes |
1999-11-18T00:00:00Z | Thursday | 8.0 | No |
1999-12-01T00:00:00Z | Wednesday | 7.0 | Yes |
1999-12-03T00:00:00Z | Friday | 6.0 | No |
1999-12-06T00:00:00Z | Monday | 6.5 | Yes |
1999-12-13T00:00:00Z | Monday | 9.0 | Yes |
1999-12-16T00:00:00Z | Thursday | 7.0 | Yes |
1999-12-21T00:00:00Z | Tuesday | 8.0 | Yes |
From there the clean data is available for analyses that couldn’t have been performed in its original format.
refine_delete()
To clean up your OpenRefine workspace you can delete projects using refine_delete()
. Just like refine_export()
it’s possible to pass either a project name or unique identifier to this function. And it is only necessary to use both if there are multiple projects with the same name.
refine_delete(project.name = "lfm_cleanup")
https://docs.openrefine.org/technical-reference/openrefine-api#third-party-software-libraries↩︎
For documentation on how to specify a different host or port number see ?refine_path()
.↩︎
https://github.com/OpenRefine/OpenRefine/wiki/Changes-for-3.3#csrf-protection-changes↩︎
https://docs.openrefine.org/technical-reference/openrefine-api#apply-operations↩︎
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.