The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This package contains R functions corresponding to useful Stata commands.
The package includes: - panel data functions (monthly/quarterly dates, lead/lag, fillin) - data.frame functions (tabulate, merge) - vector functions (xtile, pctile, winsorize) - graph functions (binscatter)
sum_up
prints detailed summary statistics (corresponds
to Stata summarize
)
<- 100
N <- tibble(
df id = 1:N,
v1 = sample(5, N, TRUE),
v2 = sample(1e6, N, TRUE)
)sum_up(df)
%>% sum_up(starts_with("v"), d = TRUE)
df %>% group_by(v1) %>% sum_up() df
tab
prints distinct rows with their count. Compared to
the dplyr function count
, this command adds frequency,
percent, and cumulative percent.
<- 1e2 ; K = 10
N <- tibble(
df id = sample(c(NA,1:5), N/K, TRUE),
v1 = sample(1:5, N/K, TRUE)
)tab(df, id)
tab(df, id, na.rm = TRUE)
tab(df, id, v1)
join
is a wrapper for dplyr merge functionalities, with
two added functions
The option check
checks there are no duplicates in
the master or using data.tables (as in Stata).
# merge m:1 v1
join(x, y, kind = "full", check = m~1)
The option gen
specifies the name of a new variable
that identifies non matched and matched rows (as in Stata).
# merge m:1 v1, gen(_merge)
join(x, y, kind = "full", gen = "_merge")
The option update
allows to update missing values of
the master dataset by the value in the using dataset
# pctile computes quantile and weighted quantile of type 2 (similarly to Stata _pctile)
<- c(NA, 1:10)
v pctile(v, probs = c(0.3, 0.7), na.rm = TRUE)
# xtile creates integer variable for quantile categories (corresponds to Stata xtile)
<- c(NA, 1:10)
v xtile(v, n_quantiles = 3) # 3 groups based on terciles
xtile(v, probs = c(0.3, 0.7)) # 3 groups based on two quantiles
xtile(v, cutpoints = c(2, 3)) # 3 groups based on two cutpoints
# winsorize (default based on 5 x interquartile range)
<- c(1:4, 99)
v winsorize(v)
winsorize(v, replace = NA)
winsorize(v, probs = c(0.01, 0.99))
winsorize(v, cutpoints = c(1, 50))
The classes “monthly” and “quarterly” print as dates and are
compatible with usual time extraction (ie month
,
year
, etc). Yet, they are stored as integers representing
the number of elapsed periods since 1970/01/0 (resp in week, months,
quarters). This is particularly handy for simple algebra:
# elapsed dates
library(lubridate)
<- mdy(c("04/03/1992", "01/04/1992", "03/15/1992"))
date <- as.monthly(date)
datem # displays as a period
datem#> [1] "1992m04" "1992m01" "1992m03"
# behaves as an integer for numerical operations:
+ 1
datem #> [1] "1992m05" "1992m02" "1992m04"
# behaves as a date for period extractions:
year(datem)
#> [1] 1992 1992 1992
tlag
/tlead
a vector with respect to a
number of periods, not with respect to the number of
rows
<- c(1989, 1991, 1992)
year <- c(4.1, 4.5, 3.3)
value tlag(value, 1, time = year)
library(lubridate)
<- mdy(c("01/04/1992", "03/15/1992", "04/03/1992"))
date <- as.monthly(date)
datem <- c(4.1, 4.5, 3.3)
value tlag(value, time = datem)
In constrast to comparable functions in zoo
and
xts
, these functions can be applied to any vector and be
used within a dplyr
chain:
<- tibble(
df id = c(1, 1, 1, 2, 2),
year = c(1989, 1991, 1992, 1991, 1992),
value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)%>% group_by(id) %>% mutate(value_l = tlag(value, time = year)) df
is.panel
checks whether a dataset is a panel i.e. the
time variable is never missing and the combinations (id, time) are
unique.
<- tibble(
df id1 = c(1, 1, 1, 2, 2),
id2 = 1:5,
year = c(1991, 1993, NA, 1992, 1992),
value = c(4.1, 4.5, 3.3, 3.2, 5.2)
)%>% group_by(id1) %>% is.panel(year)
df <- df %>% filter(!is.na(year))
df1 %>% is.panel(year)
df1 %>% group_by(id1) %>% is.panel(year)
df1 %>% group_by(id1, id2) %>% is.panel(year) df1
fill_gap transforms a unbalanced panel into a balanced panel. It
corresponds to the stata command tsfill
. Missing
observations are added as rows with missing values.
<- tibble(
df id = c(1, 1, 1, 2),
datem = as.monthly(mdy(c("04/03/1992", "01/04/1992", "03/15/1992", "05/11/1992"))),
value = c(4.1, 4.5, 3.3, 3.2)
)%>% group_by(id) %>% fill_gap(datem)
df %>% group_by(id) %>% fill_gap(datem, full = TRUE)
df %>% group_by(id) %>% fill_gap(datem, roll = "nearest") df
stat_binmean()
(a stat
for ggplot2) returns
the mean of y
and x
within 20 bins of
x
. It’s a barebone version of the Stata command binscatter
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length)) + stat_binmean()
# change number of bins
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean(n = 10)
# add regression line
ggplot(iris, aes(x = Sepal.Width , y = Sepal.Length, color = Species)) + stat_binmean() + stat_smooth(method = "lm", se = FALSE)
You can install
The latest released version from CRAN with
install.packages("statar")
The current version from github with
::install_github("matthieugomez/statar") devtools
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.