The data
argument must be provided a data.frame
arranged in “long” or “tidy” format [@Wickham2014]. Each row should be an alternative from a choice observation. The choice observations do not have to be symmetric (i.e. each choice observation could have a different number of alternatives). The data must also include variables for each of the following arguments in the logitr()
function:
choiceName
: A dummy variable that identifies which alternative was chosen (1
is chosen, 0
is not chosen). Only one alternative should be chosen per choice observation.obsIDName
: A sequence of repeated numbers that identifies each unique choice observation For example, if the first three choice observations had 2 alternatives each, then the first 6 rows of the obsID
variable would be 1, 1, 2, 2, 3, 3
.parNames
: The names of the variables that will be used as model covariates. For WTP space models, the price variable should not be included in parNames
as it is provided separately with the priceName
argument.The {logitr} package contains several example data sets that illustrate this data structure. For example, in the yogurt
data set, which contains observations of yogurt purchases by a panel of 100 households, each row is an alternative from a choice observation. Choice is identified by the choice
column, the observation ID is identified by the obsID
column, and the columns price
, feat
, and brand
can be used as model covariates (brand is also broken out into additional dummy-coded columns):
head(yogurt)
#> id obsID alt choice price feat brand dannon hiland weight yoplait
#> 1 1 1 1 0 8.1 0 dannon 1 0 0 0
#> 2 1 1 2 0 6.1 0 hiland 0 1 0 0
#> 3 1 1 3 1 7.9 0 weight 0 0 1 0
#> 4 1 1 4 0 10.8 0 yoplait 0 0 0 1
#> 5 1 2 1 1 9.8 0 dannon 1 0 0 0
#> 6 1 2 2 0 6.4 0 hiland 0 1 0 0
This data set also has an alt
variable that determines the alternatives included in the choice set of each observation and an id
variable that determines the individual (this data set contains repeated observations from each individual).
Variables are modeled as either continuous or discrete based on their data type. Numeric variables are by default estimated with a single “slope” coefficient. For example, consider a data frame that contains a price
variable with the levels $10, $15, and $20. Adding price
to the parNames
argument in the main logitr()
function would result in a single price
coefficient for the “slope” of the change in price.
In contrast, categorical variables (i.e. character
or factor
type variables) are by default estimated with a coefficient for all but the first level, which serves as the reference level. The default reference level is determined alphabetically, but it can also be set by modifying the factor levels for that variable. For example, the default reference level for the brand
variable is "dannon"
as it is alphabetically first. To set "weight"
as the reference level, the factor levels can be modified using the factor()
function:
<- c("weight", "hiland", "yoplait", "dannon")
brands $brand <- factor(yogurt$brand, levels = brands) yogurt
If you wish to make dummy-coded variables yourself to use them in a model, I recommend using the dummy_cols()
function from the {fastDummies} package. For example, in the code below, I create dummy-coded columns for the brand
variable and then use those variables as covariates in a model:
<- fastDummies::dummy_cols(yogurt, "brand") yogurt
The yogurt
data frame now has new dummy-coded columns for brand (it actually already had these, but now there are additional ones):
head(yogurt)
#> # A tibble: 6 x 15
#> id obsID alt choice price feat brand dannon hiland weight yoplait
#> <dbl> <int> <int> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 0 8.1 0 dannon 1 0 0 0
#> 2 1 1 2 0 6.10 0 hiland 0 1 0 0
#> 3 1 1 3 1 7.90 0 weight 0 0 1 0
#> 4 1 1 4 0 10.8 0 yoplait 0 0 0 1
#> 5 1 2 1 1 9.80 0 dannon 1 0 0 0
#> 6 1 2 2 0 6.40 0 hiland 0 1 0 0
#> # … with 4 more variables: brand_dannon <int>, brand_hiland <int>,
#> # brand_weight <int>, brand_yoplait <int>
Now I can use those columns as covariates:
<- logitr(
mnl_pref_dummies data = yogurt,
choiceName = 'choice',
obsIDName = 'obsID',
parNames = c(
'price', 'feat', 'brand_yoplait', 'brand_dannon', 'brand_weight')
)
Running Model...
Done!
summary(mnl_pref_dummies)
#> =================================================
#> MODEL SUMMARY:
#>
#> Model Space: Preference
#> Model Run: 1 of 1
#> Iterations: 18
#> Elapsed Time: 0h:0m:0.11s
#> Exit Status: 3
#> Weights Used?: FALSE
#>
#> Model Coefficients:
#> Estimate StdError tStat pVal signif
#> price -0.366581 0.024366 -15.0447 0 ***
#> feat 0.491412 0.120063 4.0930 0 ***
#> brand_yoplait 4.450197 0.187118 23.7828 0 ***
#> brand_dannon 3.715575 0.145419 25.5508 0 ***
#> brand_weight 3.074399 0.145384 21.1467 0 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Model Fit Values:
#>
#> Log-Likelihood: -2656.8878788
#> Null Log-Likelihood: -3343.7419990
#> AIC: 5323.7758000
#> BIC: 5352.7168000
#> McFadden R2: 0.2054148
#> Adj McFadden R2: 0.2039195
#> Number of Observations: 2412.0000000