The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
sift facilitates intelligent & efficient exploration of datasets.
# install.packages("devtools")
::install_github("sccmckenzie/sift") devtools
sift is designed to work seamlessly with tidyverse.
library(tidyverse) # needed for below examples
library(sift)
sift::sift()
dplyr::filter()
that includes neighboring
observations.Perhaps you remember the Utah monolith. The buzz surrounding its discovery (and disappearance) served as a welcome diversion from the otherwise upsetting twists and turns of 2020.
Suppose we are asked: what else was happening in the world around this time?
Let’s peruse the nyt2020
dataset to refresh our
memory.
%>%
nyt2020 filter(str_detect(headline, "Monolith")) %>%
glimpse()
#> Rows: 1
#> Columns: 6
#> $ headline <chr> "Monolith Discovered in Utah Desert"
#> $ abstract <chr> "A metal monolith, planted firmly in the ground with no c~
#> $ byline <chr> "By Storyful"
#> $ pub_date <date> 2020-11-24
#> $ section_name <chr> "Science"
#> $ web_url <chr> "https://www.nytimes.com/video/science/earth/100000007471~
The monolith story broke on 2020-11-24. Prior to writing this documentation, I certainly would not have remembered this happening in November specifically.
Let’s take a peek at other headlines from ±2 days.
%>%
nyt2020 filter(pub_date > "2020-11-22",
< "2020-11-26") %>%
pub_date select(headline, pub_date)
#> # A tibble: 15 x 2
#> headline pub_date
#> <chr> <date>
#> 1 Biden Has Chosen a Secretary of State 2020-11-23
#> 2 Pat Quinn, Who Promoted A.L.S. Ice Bucket Challenge, Dies at 37 2020-11-23
#> 3 Business Leaders, Citing Damage to Country, Urge Trump to Begin T~ 2020-11-23
#> 4 Pandemic Crowds Bring ‘Rivergeddon’ to Montana’s Rivers 2020-11-23
#> 5 No, Joe Biden did not have a maskless birthday party last week. 2020-11-23
#> 6 Monolith Discovered in Utah Desert 2020-11-24
#> 7 Coronavirus in N.Y.: Latest Updates 2020-11-24
#> 8 Two Darwin Notebooks, Missing for Decades, Were Most Likely Stolen 2020-11-24
#> 9 Recent Commercial Real Estate Transactions 2020-11-24
#> 10 Trump Administration Approves Start of Formal Transition to Biden 2020-11-24
#> 11 The C.D.C. is considering shortening its recommended quarantine p~ 2020-11-25
#> 12 A Poem of Gratitude From Nebraska 2020-11-25
#> 13 Casualties From Banned Cluster Bombs Nearly Doubled in 2019, Most~ 2020-11-25
#> 14 Iran Frees British-Australian Scholar in Prisoner Swap 2020-11-25
#> 15 A Poem of Gratitude From West Virginia 2020-11-25
Notice that it took two steps to achieve the above
result. We first had to find the date of the monolith
story then perform a subsequent call to
filter()
. This procedure would quickly become a
nuisance after a few iterations.
sift()
provides an interface to perform this exact
process in one step.
%>%
nyt2020 sift(pub_date, scope = 2, str_detect(headline, "Monolith")) %>%
select(headline, pub_date)
#> # A tibble: 15 x 2
#> headline pub_date
#> <chr> <date>
#> 1 Biden Has Chosen a Secretary of State 2020-11-23
#> 2 Pat Quinn, Who Promoted A.L.S. Ice Bucket Challenge, Dies at 37 2020-11-23
#> 3 Business Leaders, Citing Damage to Country, Urge Trump to Begin T~ 2020-11-23
#> 4 Pandemic Crowds Bring ‘Rivergeddon’ to Montana’s Rivers 2020-11-23
#> 5 No, Joe Biden did not have a maskless birthday party last week. 2020-11-23
#> 6 Monolith Discovered in Utah Desert 2020-11-24
#> 7 Coronavirus in N.Y.: Latest Updates 2020-11-24
#> 8 Two Darwin Notebooks, Missing for Decades, Were Most Likely Stolen 2020-11-24
#> 9 Recent Commercial Real Estate Transactions 2020-11-24
#> 10 Trump Administration Approves Start of Formal Transition to Biden 2020-11-24
#> 11 The C.D.C. is considering shortening its recommended quarantine p~ 2020-11-25
#> 12 A Poem of Gratitude From Nebraska 2020-11-25
#> 13 Casualties From Banned Cluster Bombs Nearly Doubled in 2019, Most~ 2020-11-25
#> 14 Iran Frees British-Australian Scholar in Prisoner Swap 2020-11-25
#> 15 A Poem of Gratitude From West Virginia 2020-11-25
Under the hood, sift()
passes
str_detect(headline, "Monolith")
to
dplyr::filter()
, then augments the filtered observations to
include any rows falling in ±2 day window (specified by
pub_date
and scope = 2
).
sift::break_join()
dplyr::left_join()
&
findInterval()
.Take a look at the structure of us_uk_pop
and
us_uk_leaders
below. How would you join these two datasets
together? Specifically, we want each row in us_uk_pop
to
contain information (name
, party
) for the
leader at that time.
%>%
us_uk_pop group_by(country) %>%
slice_head(n = 3)
#> # A tibble: 6 x 3
#> # Groups: country [2]
#> country date population
#> <chr> <date> <int>
#> 1 UK 1995-01-21 57997197
#> 2 UK 1996-01-19 58168519
#> 3 UK 1997-01-21 58346633
#> 4 USA 1995-01-20 268039654
#> 5 USA 1996-01-20 271231546
#> 6 USA 1997-01-19 274606475
us_uk_leaders#> # A tibble: 11 x 4
#> country name start party
#> <chr> <chr> <date> <chr>
#> 1 USA Bush 1989-01-20 Republican
#> 2 USA Clinton 1993-01-20 Democratic
#> 3 USA Bush 2001-01-20 Republican
#> 4 USA Obama 2009-01-20 Democratic
#> 5 UK Thatcher 1979-05-04 Conservative
#> 6 UK Major 1990-11-28 Conservative
#> 7 UK Blair 1997-05-02 Labour
#> 8 UK Brown 2007-06-27 Labour
#> 9 UK Cameron 2010-05-11 Conservative
#> 10 UK May 2016-07-13 Conservative
#> 11 UK Johnson 2019-07-24 Conservative
If you look closely at the dates in us_uk_pop
, they
typically fall around January 20th (US inauguration day). Joining by
country
& year(date/start)
would sweep
this inconvenient detail under the rug.
For one country alone, we could use findInterval
.
%>%
us_uk_pop filter(country == "USA") %>%
mutate(name = filter(us_uk_leaders, country == "USA")$name[findInterval(date, filter(us_uk_leaders, country == "USA")$start)])
#> # A tibble: 19 x 4
#> country date population name
#> <chr> <date> <int> <chr>
#> 1 USA 1995-01-20 268039654 Clinton
#> 2 USA 1996-01-20 271231546 Clinton
#> 3 USA 1997-01-19 274606475 Clinton
#> 4 USA 1998-01-20 278053607 Clinton
#> 5 USA 1999-01-20 281419130 Clinton
#> 6 USA 2000-01-19 284594395 Clinton
#> 7 USA 2001-01-18 287532638 Clinton
#> 8 USA 2002-01-19 290270187 Bush
#> 9 USA 2003-01-18 292883010 Bush
#> 10 USA 2004-01-17 295487267 Bush
#> 11 USA 2005-01-19 298165797 Bush
#> 12 USA 2006-01-19 300942917 Bush
#> 13 USA 2007-01-21 303786752 Bush
#> 14 USA 2008-01-21 306657153 Bush
#> 15 USA 2009-01-21 309491893 Obama
#> 16 USA 2010-01-18 312247116 Obama
#> 17 USA 2011-01-18 314911752 Obama
#> 18 USA 2012-01-19 317505266 Obama
#> 19 USA 2013-01-21 320050716 Obama
The above code is somewhat unintelligible. Additionally, there is no
straightforward way to accommodate UK
&
USA
rows.
break_join()
provides a simple interface leveraging
functionality of dplyr::left_join()
and
findInterval()
.
break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"))
#> Joining, by = "country"
#> # A tibble: 38 x 5
#> country date population name party
#> <chr> <date> <int> <chr> <chr>
#> 1 USA 1995-01-20 268039654 Clinton Democratic
#> 2 USA 1996-01-20 271231546 Clinton Democratic
#> 3 USA 1997-01-19 274606475 Clinton Democratic
#> 4 USA 1998-01-20 278053607 Clinton Democratic
#> 5 USA 1999-01-20 281419130 Clinton Democratic
#> 6 USA 2000-01-19 284594395 Clinton Democratic
#> 7 USA 2001-01-18 287532638 Clinton Democratic
#> 8 USA 2002-01-19 290270187 Bush Republican
#> 9 USA 2003-01-18 292883010 Bush Republican
#> 10 USA 2004-01-17 295487267 Bush Republican
#> # ... with 28 more rows
Notice that country
was detected as a common variable
(courtesy of dplyr::left_join()
).
Alternatively, we could have supplied by
explicitly.
# effectively the same as above call
break_join(us_uk_pop, us_uk_leaders, brk = c("date" = "start"), by = "country")
Additional arguments supplied to ...
will be
automatically directed to dplyr::left_join()
and
findInterval()
.
set.seed(1)
<- tibble(x = 1:5, y = runif(5, 4, 6))
a <- tibble(y = c(4, 5), z = c("A", "B"))
b
break_join(a, b, brk = "y")
#> # A tibble: 5 x 3
#> x y z
#> <int> <dbl> <chr>
#> 1 5 4.40 A
#> 2 1 4.53 A
#> 3 2 4.74 A
#> 4 3 5.15 B
#> 5 4 5.82 B
break_join(a, b, brk = "y", all.inside = TRUE)
#> # A tibble: 5 x 3
#> x y z
#> <int> <dbl> <chr>
#> 1 5 4.40 A
#> 2 1 4.53 A
#> 3 2 4.74 A
#> 4 3 5.15 A
#> 5 4 5.82 A
sift::kluster()
Consider the faithful
dataset.
Density plot below clearly demonstrates there are 2 clusters of eruptions.
Currently, these clusters are implicit, meaning we do not have a categorical variable associating each observation with a cluster. We could manually assign clusters by drawing a line at, say, 3.0.
kluster()
does this automatically - no extra inputs
needed.
<- kluster(faithful$eruptions) k
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.