---
title: "Getting Started with surveycore"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
bibliography: references.bib
link-citations: true
vignette: >
  %\VignetteIndexEntry{Getting Started with surveycore}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
library(surveycore)
knitr::opts_chunk$set(
  comment = "#>"
)
```

`surveycore` gives you a (mostly) complete workflow for survey data analysis.
This vignette is designed to give you a quick overview of the main functionality
present in this package. It is comprised of two main sections:

1. Creating survey objects

2. Conducting simple analysis

**Quick PSA before jumping in:**

`surveycore` was built as an alternative to `survey` and `srvyr`. However, the
code powering the variance estimation and analysis is vendored from the `survey` 
package. Every aspect of this package that calculates anything has been
tested to ensure it provides the same numerical results. This means every
estimate `surveycore` produces has been numerically verified to match `survey`
output. Without Thomas Lumley's work on that package, surveycore would not be 
possible. 

---

## Creating the survey object

The first step when conducting survey analysis is creating the right survey object
where we specify the sampling design, weights, and whatever other information is
needed. Without this information, point estimates may be biased and standard errors 
are almost certainly wrong [@lumley2010; @lohr2022].

Fortunately, we don't have to calculate that uncertainty ourselves! That's what
the survey objects are for. They tell the analysis functions how to conduct
their analysis so they can properly take into account the variance and bias from
the survey design. 

`surveycore` has four different survey object constructors:

1. `as_survey()`

2. `as_survey_replicate()`

3. `as_survey_nonprob()`

4. `as_survey_twophase()`

Rather than going into detail on each constructor, I'm just going to provide a quick
overview of each. For more information visit `vignette("creating-survey-objects")`.

### `as_survey()` 

You want to use this if you used a probability sample and the data you have has 
cluster IDs, strata, and/or design weights. In this example we'll use the General
Social Survey which has variables for clustering, strata, and design weights. 

``` {r as_survey}
gss_svy <- as_survey(
  gss_2024,
  ids = vpsu,
  strata = vstrat,
  weights = wtssps
)

gss_svy

```


### `as_survey_replicate()`

Use this when the data you have comes with pre-built replicate weight columns 
like `repwt_1`, `repwt_2`. For example, Pew's Jewish American study from 2020
uses replicate weights.


```{r replicate}
pew_jewish_svy <- as_survey_replicate(
  pew_jewish_2020,
  weights = extweight,
  repweights = extweight1:extweight100,
  type = "JK1"
)

pew_jewish_svy
```


### SRS designs with `as_survey()`

When your survey was a pure simple random sample — each respondent had equal
probability of selection, no clustering, no strata — use `as_survey()` without
specifying `ids` or `strata`. Omitting both creates a `survey_taylor` object
with no cluster or stratum structure, which is the SRS special case.

Here is a synthetic school district survey to illustrate:

```{r srs}
set.seed(101)
N <- 400 # total schools in district
n <- 80 # schools sampled

school_survey <- data.frame(
  school_id = sample(seq_len(N), n),
  avg_score = round(rnorm(n, mean = 72, sd = 11), 1),
  pct_frpl = round(runif(n, 0.10, 0.85), 2), # % free/reduced price lunch
  enrollment = round(runif(n, 180, 850)),
  sw = N / n, # equal sampling weight = 400/80 = 5.0
  fpc = N # population size for FPC
)

school_svy <- as_survey(
  school_survey,
  weights = sw, # each sampled school represents 5 schools in the population
  fpc = fpc # reduces SEs: we sampled 20% of the population
)

school_svy
```

### `as_survey_nonprob()`

Use this when you used a non-probability panel (e.g., Qualtrics Panels,
Cint/Lucid, Dynata, YouGov, etc.) and have created weights designed to 
ensure the sample is representative of the population you are interested
in.

```{r calibrated}
ns_wave1_svy <- as_survey_nonprob(ns_wave1, weights = weight)

ns_wave1_svy
```

### `as_survey_twophase()`

This is very rare, so it's unlikely you have this. But, if you took a large
sample of a population and then resampled a subset of them, then you might
have a two-phase design. Some common contexts are: case-cohort studies, 
medical validation studies, or surveys with a screening phase.

We will use the nwtco data from the `survival` package.

```{r nwtco, eval=requireNamespace("survival", quietly=TRUE)}
nwtco <- survival::nwtco

# in.subcohort is stored as 0/1 — must be logical for as_survey_twophase()
nwtco$in.subcohort <- as.logical(nwtco$in.subcohort)

# Phase 1: all 4,028 enrolled patients (each patient is their own unit)
phase1 <- as_survey(nwtco, ids = seqno)

# Phase 2: subcohort, with Phase 2 sampling stratified by relapse status
nwtco_svy <- as_survey_twophase(
  phase1,
  strata2 = rel, # Phase 2 strata: cases (rel=1) vs. non-cases (rel=0)
  subset = in.subcohort, # Logical column: TRUE = selected into Phase 2
  method = "full"
)

nwtco_svy
```


As you may have noticed, each survey object has a print method that 
shows the first 10 rows of each data set, similar to a tibble, but also 
includes a brief description of the survey design. 

---

## Conducting Simple Analysis

In addition to creating survey objects, `surveycore` has several 
functions designed to make analysis easier: 

- `get_freqs()`

- `get_means()`

- `get_totals()`

- `get_corr()`

- `get_ratios()`

- `get_quantiles()`

### Frequency tables — `get_freqs()`

`get_freqs()` calculates weighted frequencies (aka proportions). The 
first argument is the survey design, the second is the variable you want 
to get the frequencies for. Here's a simple example where we calculate 
whether or not people are willing to consider voting for Trump.

```{r freqs-basic}

get_freqs(ns_wave1_svy, consider_trump)

```


**Analyzing multiple variables at once**

A key piece of survey research involves select-all-that-apply style questions. 
For example, the Nationscape data asked people: "We're interested in where you
might have heard news about politics in the last week. Please indicate which of 
the following sources you used." Rather than looking at each one individually,
`get_freqs()` allows you to pass in multiple variables (it uses `tidy-select`
under the hood to do this). Let's look at an example:

```{r freqs-multi}
get_freqs(ns_wave1_svy, c(news_sources_facebook:news_sources_other))

```

The `name` column identifies which variable each row belongs to; `value`
holds the response code. You can also change the name of the columns 
if you want. For example:

```{r}
ns_wave1_svy |>
  get_freqs(
    c(news_sources_facebook:news_sources_other),
    names_to = "news_source",
    values_to = "choice"
  )

```


### Weighted means — `get_means()`

`get_means()` estimates the survey-weighted mean of a continuous variable.

```{r means-basic}
# Mean discrimination against blacks
get_means(ns_wave1_svy, discrimination_blacks)
```


### Population totals — `get_totals()`

`get_totals()` estimates the weighted sum: the total count or aggregate for
the target population. It's important to note that the weights typically used in
non-probability samples are calibration weights which are designed to ensure the 
sample is representative of the population it is sampled from. These weights will
not give you accurate population totals. However, design weights scaled to
represent the target population will give estimated population size. 

To show the difference, we will use Pew's Jewish-Americans Study from 2020 to 
show a study with weights that show population totals and the Nationscape data
to show how calibrated weights do not.

First, we will look at the Nationscape data. Since these are calibration weights,
the totals will add up to the number of rows (6,422). 

```{r}

get_totals(ns_wave1_svy)

```

However, the Pew Jewish-Americans study shows us the estimated total population
of Jews in the US:

```{r}

get_totals(pew_jewish_svy)

```

Note how we did not use any variables. That's because we are interested in
the sample's population size. However, if we are interested in the estimated
total income or age, we can do that by specifying it in the `x` argument.

If we wanted to understand how many estimated people selected a specific 
response option, we use the `group` argument. For example: if we wanted to 
know how many Jews are in each age category we would calculate it like this:

```{r}
get_totals(pew_jewish_svy, group = age4cat)
```

---

### Weighted correlations — `get_corr()`

`get_corr()` estimates survey-weighted Pearson correlations between two or
more continuous variables. Confidence intervals use the Fisher Z
transformation, guaranteeing bounds in (−1, 1).

Let's look at approval for Trump and Biden. First we clean the underlying
data frame — dropping rows with missing values and removing "Not sure"
responses (coded `999`) — then rebuild the survey object.

```{r corr-basic}
ns_wave1_clean <- ns_wave1 |>
  dplyr::filter(
    !is.na(cand_favorability_trump),
    !is.na(cand_favorability_biden),
    cand_favorability_trump != 999,
    cand_favorability_biden != 999
  )

ns_wave1_clean_svy <- as_survey_nonprob(ns_wave1_clean, weights = weight)

get_corr(
  ns_wave1_clean_svy,
  c(cand_favorability_trump, cand_favorability_biden)
)
```

Next, let's look at favorability across multiple variables.

```{r corr-multi}
fav_vars <- c(
  "cand_favorability_trump", "cand_favorability_biden",
  "cand_favorability_harris", "cand_favorability_sanders",
  "cand_favorability_warren", "cand_favorability_buttigieg",
  "cand_favorability_pence"
)

ns_wave1_multi <- ns_wave1 |>
  dplyr::filter(
    dplyr::if_all(dplyr::all_of(fav_vars), ~ !is.na(.x) & .x != 999)
  )

ns_wave1_multi_svy <- as_survey_nonprob(ns_wave1_multi, weights = weight)

get_corr(
  ns_wave1_multi_svy,
  c(cand_favorability_trump:cand_favorability_pence)
)
```

The output defaults to a long version where each row is a unique
variable pair. It shows the correlation in `r`, the confidence
intervals, p-values, and other relevant information.

Switch to wide format for a more familiar correlation-matrix layout:

```{r corr-wide}
get_corr(
  ns_wave1_multi_svy,
  c(cand_favorability_trump:cand_favorability_pence),
  format = "wide"
)
```


---

### Ratio estimation — `get_ratios()`

`get_ratios()` estimates the ratio of two weighted totals:

$$\hat{R} = \frac{\sum_i w_i \, y_i}{\sum_i w_i \, x_i}$$

Variance is estimated via the delta method (linearization), equivalent to
`survey::svyratio()` [@lumley2010].

Ratios are useful when you want an estimate that is invariant to the scale of
the weights — for example, wages per hour, spending per household member, or
disease prevalence ratios.

A good example uses the Nationscape data. The favorability scale runs from 1
(Very favorable) to 4 (Very unfavorable), so we can estimate the ratio of
Trump's aggregate favorability score to Biden's. A ratio less than 1 means
Trump's aggregate score is lower — i.e., he is viewed more favorably on
average; a ratio greater than 1 means Biden is viewed more favorably.

```{r ratios-basic}
get_ratios(
  ns_wave1_clean_svy,
  numerator = cand_favorability_trump,
  denominator = cand_favorability_biden
)
```

---

### `get_quantiles()`

`get_quantiles()` estimates survey-weighted quantiles using the Woodruff (1952)
confidence interval method. Confidence intervals are derived by inverting the
weighted CDF rather than assuming normality, so they are generally **asymmetric**
around the estimate and always respect the range of the data. By default, it 
calculates the quantiles at the 25th, 50th, and 75th percentile.

```{r quantiles-basic}
# Quartiles and median of age (default probs = c(0.25, 0.5, 0.75))
get_quantiles(ns_wave1_svy, age)
```

#### Choosing quantiles

Pass any numeric vector to `probs`. For the median alone:

```{r quantiles-median}
get_quantiles(ns_wave1_svy, age, probs = 0.5)
```

For deciles of age:

```{r quantiles-deciles}
get_quantiles(ns_wave1_svy, age, probs = seq(0.1, 0.9, 0.1))
```


### Subgroup analysis — the `group` argument

Every analysis function accepts a `group` argument for computing estimates
separately within levels of a categorical variable. Pass a bare column name
or multiple using `c()`. For example, we'll look at Trump consideration
by party identification.

```{r group-means}
get_freqs(ns_wave1_svy, consider_trump, group = pid3)
```

Rows where the grouping variable is `NA` are excluded from all groups and do
not appear in the output. Responses within each group sum to 100% for
`get_freqs()`.

---

### Controlling uncertainty output

All analysis functions share a common `variance` argument. You can request
any combination of:

| Code   | What it returns                                     |
|--------|-----------------------------------------------------|
| `"se"` | Standard error                                      |
| `"ci"` | Confidence interval: `ci_low`, `ci_high`            |
| `"var"`| Variance (square of the SE)                         |
| `"cv"` | Coefficient of variation (SE / estimate)            |
| `"moe"`| Margin of error at `conf_level`                     |
| `"deff"` | Design effect (complex design variance / SRS variance) |

The `conf_level` argument controls the confidence level for `"ci"` and
`"moe"`. The default is `0.95`; for a 90% CI:

```{r variance-options}
get_means(
  ns_wave1_svy,
  age,
  variance = c("se", "ci", "moe"),
  conf_level = 0.9
)

```

Set `variance = NULL` to suppress all uncertainty columns and return point
estimates and sample counts only.

### Estimated population counts

Add `n_weighted = TRUE` to any function to include the estimated population
count — the sum of weights — alongside the unweighted sample count `n`. Let's
look at Pew's Jewish-Americans data again. Using `get_freqs()` and 
`n_weighted = TRUE` we can see proportions and estimated populations. In this
example, we're looking at the proportion (`pct`), the number of people from the
sample (`n`), and the estimated population size (`n_weighted`) of Jewish-
Americans in each age category.

```{r n-weighted}
get_freqs(pew_jewish_svy, age4cat, n_weighted = TRUE)

```


---

## Summary

| Function | Use for |
|---|---|
| `get_freqs()` | Categorical variables — weighted distributions, percentages |
| `get_means()` | Continuous variables — weighted means |
| `get_totals()` | Population counts or aggregates — weighted sums |
| `get_ratios()` | Ratios of two weighted totals |
| `get_corr()` | Pairwise Pearson correlations |
| `get_quantiles()` | Weighted quantiles and median — Woodruff CIs |

All functions:
- Return a tibble subclass ready for further analysis or display
- Accept a `group` argument for subgroup estimates
- Accept a `variance` argument to control which uncertainty columns appear
- Handle all survey design classes: `survey_taylor`, `survey_replicate`,
  `survey_twophase`, and `survey_nonprob`