The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if \(Z\) is Bernoulli with probability \(1-p_0\) and \(P\) is Poisson with mean \(\lambda\), then \(Y=Z+(1-Z)P\) is zero-inflated Poisson. The ZIP is a latent-class model; we can have \(Y=0\) either because \(Z=0\) ('structural' zeroes) or because \(P=0\). That’s natural in some ecological examples: if you didn’t see any salmon it could be because the area is salmon-free (it’s Eden Park) or because you just randomly didn’t see any. To turn this into a regression model we typically put a logistic regression structure on \(Z\) and a Poisson regression structure on \(P\).
There isn’t (as far as I know) existing software in R for design-based inference in zero-inflated Poisson models, so it’s a good example for the benefits of svyVGAM
. The pscl
package (Zeileis et al) fits zero-inflated models, and so does VGAM
, so we can compare the model fitted with svyVGAM
to both of those and to other work-arounds.
I’ll do an example with data on number of sexual partners, from NHANES 2003-2004. We will look at questions SXQ200
and SXQ100
: number of male sexual partners. Combining these gives a ‘real’ zero-inflated variable: based on sexual orientation the zeroes would divide into 'never' and 'not yet'.
Here's how I created the dataset, from two NHANES files. It's data(nhanes_sxq)
in the package
library(foreign)
setwd("~/nhanes")
demo = read.xport("demo_c.xpt")
sxq = read.xport("sxq_c.xpt")
merged = merge(demo, sxq, by='SEQN')
merged$total = with(merged, ifelse(RIAGENDR==2, SXQ100+SXQ130, SXQ170+SXQ200))
merged$total[merged$SXQ020==2] = 0
merged$total[merged$total>2000] = NA
merged$age = merged$RIDAGEYR/25
merged$malepartners=with(merged, ifelse(RIAGENDR==2,SXQ100,SXQ200))
merged$malepartners[merged$malepartners>200]=NA
nhanes_sxq<-merged[, c("SDMVPSU","SDMVSTRA","WTINT2YR","RIDAGEYR","RIDRETH1","DMDEDUC","malepartners")]
Start off by loading the packages and setting up a survey design
library(svyVGAM)
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:VGAM':
##
## calibrate
## The following object is masked from 'package:graphics':
##
## dotchart
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
data(nhanes_sxq)
des = svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~WTINT2YR, nest=TRUE, data=nhanes_sxq)
First, we'll fit the model just ignoring the survey design, using both pscl::zeroinfl
and VGAM::vglm
. These models use the same variables in a logistic regression for \(Z\) and a Poisson regression for \(P\). In VGAM
you would make the models different by constraining coefficients to be zero in one of the models; in pscl
you would specify different models before and after the |
.
unwt = zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq)
summary(unwt)
##
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC |
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.0204 -0.9433 -0.7871 0.1503 29.2567
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6666228 0.0506660 32.894 < 2e-16 ***
## RIDAGEYR -0.0055102 0.0008969 -6.143 8.08e-10 ***
## factor(RIDRETH1)2 -0.0394019 0.0779480 -0.505 0.613
## factor(RIDRETH1)3 0.6508821 0.0345734 18.826 < 2e-16 ***
## factor(RIDRETH1)4 0.6675311 0.0365963 18.240 < 2e-16 ***
## factor(RIDRETH1)5 0.5642954 0.0594928 9.485 < 2e-16 ***
## DMDEDUC 0.0094256 0.0135180 0.697 0.486
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.188125 0.187079 1.006 0.31461
## RIDAGEYR -0.002938 0.003629 -0.810 0.41823
## factor(RIDRETH1)2 -0.079636 0.242311 -0.329 0.74242
## factor(RIDRETH1)3 0.118369 0.116120 1.019 0.30803
## factor(RIDRETH1)4 0.143300 0.127764 1.122 0.26203
## factor(RIDRETH1)5 0.259515 0.223032 1.164 0.24460
## DMDEDUC -0.148881 0.053337 -2.791 0.00525 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 18
## Log-likelihood: -9518 on 14 Df
vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef")
##
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
## family = zipoisson(), data = nhanes_sxq, crit = "coef")
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 RIDAGEYR:1 RIDAGEYR:2
## 0.188125349 1.666622759 -0.002937819 -0.005510160
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2
## -0.079635992 -0.039401949 0.118369301 0.650882145
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2
## 0.143300364 0.667531080 0.259515415 0.564295398
## DMDEDUC:1 DMDEDUC:2
## -0.148881313 0.009425589
##
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -9517.556
A traditional work-around for regression models is to rescale the weights to sum to the sample size and then pretend they are precision weights or frequency weights.
nhanes_sxq$scaledwt<-nhanes_sxq$WTINT2YR/mean(nhanes_sxq$WTINT2YR)
wt= zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq, weights=scaledwt)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(wt)
##
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC |
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq, weights = scaledwt)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.5864 -0.8418 -0.5430 0.1324 31.9106
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8340681 0.0614994 29.823 < 2e-16 ***
## RIDAGEYR -0.0073881 0.0009059 -8.155 3.49e-16 ***
## factor(RIDRETH1)2 -0.1073312 0.0853535 -1.257 0.2086
## factor(RIDRETH1)3 0.6551367 0.0481679 13.601 < 2e-16 ***
## factor(RIDRETH1)4 0.6358148 0.0529173 12.015 < 2e-16 ***
## factor(RIDRETH1)5 0.4774124 0.0666607 7.162 7.96e-13 ***
## DMDEDUC -0.0237389 0.0143070 -1.659 0.0971 .
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.660504 0.214458 3.080 0.002071 **
## RIDAGEYR -0.007833 0.003673 -2.133 0.032959 *
## factor(RIDRETH1)2 -0.116789 0.252451 -0.463 0.643636
## factor(RIDRETH1)3 0.101971 0.151531 0.673 0.500987
## factor(RIDRETH1)4 -0.160804 0.181429 -0.886 0.375444
## factor(RIDRETH1)5 0.106779 0.230339 0.464 0.642954
## DMDEDUC -0.202397 0.057624 -3.512 0.000444 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 18
## Log-likelihood: -9766 on 14 Df
wtv= vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=scaledwt)
summary(wtv)
##
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
## family = zipoisson(), data = nhanes_sxq, weights = scaledwt,
## crit = "coef")
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 0.6605042 0.2144354 3.080 0.002069 **
## (Intercept):2 1.8340681 0.0614568 29.843 < 2e-16 ***
## RIDAGEYR:1 -0.0078333 0.0036726 -2.133 0.032934 *
## RIDAGEYR:2 -0.0073881 0.0008995 -8.214 < 2e-16 ***
## factor(RIDRETH1)2:1 -0.1167894 0.2527278 -0.462 0.643999
## factor(RIDRETH1)2:2 -0.1073312 0.0847641 -1.266 0.205430
## factor(RIDRETH1)3:1 0.1019712 0.1515002 0.673 0.500899
## factor(RIDRETH1)3:2 0.6551367 0.0481359 13.610 < 2e-16 ***
## factor(RIDRETH1)4:1 -0.1608040 0.1814098 -0.886 0.375395
## factor(RIDRETH1)4:2 0.6358147 0.0529738 12.002 < 2e-16 ***
## factor(RIDRETH1)5:1 0.1067789 0.2303235 0.464 0.642931
## factor(RIDRETH1)5:2 0.4774124 0.0663590 7.194 6.27e-13 ***
## DMDEDUC:1 -0.2023967 0.0576221 -3.512 0.000444 ***
## DMDEDUC:2 -0.0237389 0.0146964 -1.615 0.106249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(pstr0), loglink(lambda)
##
## Log-likelihood: -9765.52 on 5036 degrees of freedom
##
## Number of Fisher scoring iterations: 8
##
## No Hauck-Donner effect found in any of the estimates
## repwts
repdes = as.svrepdesign(des,type="Fay",fay.rho=0.2)
rep1 = withReplicates(repdes, quote(
coef(zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, weights=.weights))
))
rep1
## theta SE
## count_(Intercept) 1.8335175 0.1350
## count_RIDAGEYR -0.0073709 0.0028
## count_factor(RIDRETH1)2 -0.1071380 0.2471
## count_factor(RIDRETH1)3 0.6552029 0.1858
## count_factor(RIDRETH1)4 0.6361156 0.1438
## count_factor(RIDRETH1)5 0.4769791 0.2501
## count_DMDEDUC -0.0238310 0.0797
## zero_(Intercept) 0.6606269 0.2582
## zero_RIDAGEYR -0.0078221 0.0039
## zero_factor(RIDRETH1)2 -0.1156275 0.2854
## zero_factor(RIDRETH1)3 0.1015741 0.0984
## zero_factor(RIDRETH1)4 -0.1620363 0.0859
## zero_factor(RIDRETH1)5 0.1065392 0.2120
## zero_DMDEDUC -0.2025776 0.0586
repv = withReplicates(repdes, quote(
coef(vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=.weights))
))
repv
## theta SE
## (Intercept):1 0.6605042 0.2582
## (Intercept):2 1.8340681 0.1350
## RIDAGEYR:1 -0.0078333 0.0039
## RIDAGEYR:2 -0.0073881 0.0028
## factor(RIDRETH1)2:1 -0.1167894 0.2854
## factor(RIDRETH1)2:2 -0.1073312 0.2471
## factor(RIDRETH1)3:1 0.1019712 0.0983
## factor(RIDRETH1)3:2 0.6551367 0.1857
## factor(RIDRETH1)4:1 -0.1608040 0.0859
## factor(RIDRETH1)4:2 0.6358147 0.1438
## factor(RIDRETH1)5:1 0.1067789 0.2120
## factor(RIDRETH1)5:2 0.4774124 0.2501
## DMDEDUC:1 -0.2023967 0.0586
## DMDEDUC:2 -0.0237389 0.0797
Another way to fit the model using just the survey
package is with svymle
. This takes the log-likelihood and its derivative as arguments, and adds linear predictors to some or all of those arguments. That is, we specify the log-likelihood in terms of the Bernoulli parameter \(p_0\) and the Poisson mean \(\lambda\) -- actually \(\mathrm{logit} p_0\) and \(\eta=\log\lambda\), and also give the derivative with respect to these two parameters. The software does the necessary additional work to put linear predictors on the parameters and give us the zero-inflated model. In fact, svymle
is very similar in underlying approach to vglm
; the difference is that vglm
comes with a large collection of predefined models.
In defining the loglikelihood I'm going to take advantage of the Poisson pmf being available in R. Let's call it \(\digamma(y,\lambda)\). The loglikelihood is \[\ell(y; \mu,p_0)=\log\left(p_0\{y==0\}+(1-p)\digamma(y,\mu)\right)\] only we want it in terms of \(\mathrm{logit} p_0\) and \(\eta=\log \lambda\)
loglike = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
log(p*(y==0)+(1-p)*dpois(y,mu))
}
We also need the derivatives with respect to \(\mathrm{logit} p_0\) and \(\eta=\log \lambda\)
dlogitp = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
dexpit = p/(1+p)^2
num = dexpit*(y==0)-dexpit*dpois(y,mu)
denom = p*(y==0)+(1-p)*dpois(y,mu)
num/denom
}
deta = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
dmutoy = 0*y
dmutoy[y>0] = exp(-mu[y>0])*mu[y>0]^(y[y>0]-1)/factorial(y[y>0]-1)
num = (1-p)*(-dpois(y,mu)+dmutoy)
denom = p*(y==0)+(1-p)*dpois(y,mu)
num/denom
}
score = function(y,eta,logitp) cbind(deta(y,eta,logitp), dlogitp(y,eta,logitp))
And now we call svymle
giving the linear predictors for both parameters. One of the formulas needs to include the response variable \(Y\).
nlmfit = svymle(loglike=loglike, grad=score, design=des,
formulas=list(eta=malepartners~RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
logitp=~RIDAGEYR + factor(RIDRETH1) + DMDEDUC),
start=coef(unwt), na.action="na.omit",method="BFGS")
summary(nlmfit)
## Survey-sampled mle:
## svymle(loglike = loglike, gradient = score, design = des, formulas = list(eta = malepartners ~
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, logitp = ~RIDAGEYR +
## factor(RIDRETH1) + DMDEDUC), start = coef(unwt), na.action = "na.omit",
## method = "BFGS")
## Coef SE p.value
## eta.(Intercept) 1.826825789 0.154214277 < 0.001
## eta.RIDAGEYR -0.007800690 0.003014997 0.00967
## eta.factor(RIDRETH1)2 -0.119694280 0.235192596 0.61081
## eta.factor(RIDRETH1)3 0.639831600 0.165176912 < 0.001
## eta.factor(RIDRETH1)4 0.615167292 0.117750580 < 0.001
## eta.factor(RIDRETH1)5 0.465555942 0.213462405 0.02919
## eta.DMDEDUC -0.008130865 0.072679440 0.91092
## logitp.(Intercept) 0.578310169 0.246782567 0.01911
## logitp.RIDAGEYR -0.006077533 0.004017016 0.13029
## logitp.factor(RIDRETH1)2 -0.033440316 0.280701007 0.90517
## logitp.factor(RIDRETH1)3 0.124435365 0.095140203 0.19090
## logitp.factor(RIDRETH1)4 -0.151762524 0.086322705 0.07873
## logitp.factor(RIDRETH1)5 0.119530077 0.209380275 0.56808
## logitp.DMDEDUC -0.209112828 0.053553191 < 0.001
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR,
## nest = TRUE, data = nhanes_sxq)
Finally, we use svy_vglm
, with variances by linearisation
library(svyVGAM)
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=des, crit = "coef")
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR,
## nest = TRUE, data = nhanes_sxq)
##
## Call:
## vglm(formula = formula, family = family, data = surveydata, weights = .survey.prob.weights,
## crit = "coef")
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 RIDAGEYR:1 RIDAGEYR:2
## 0.660504243 1.834068104 -0.007833317 -0.007388071
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2
## -0.116789371 -0.107331190 0.101971159 0.655136725
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2
## -0.160804047 0.635814748 0.106778915 0.477412443
## DMDEDUC:1 DMDEDUC:2
## -0.202396659 -0.023738881
##
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -493703966
and by replicate weights
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=repdes, crit = "coef")
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.