Model Specification Language
Contents
Graphical models
Graphs as a formal language
The BUGS language: stochastic nodes
Censoring and truncation
Constraints on using certain distributions
Logical nodes
Arrays and indexing
Repeated structures
Data transformations
Nested indexing and mixtures
Formatting of data
Graphical models
[top]
We strongly recommend that the first step in any analysis should be the construction of a
directed graphical model
. Briefly, this represents all quantities as nodes in a directed graph, in which arrows run into nodes from their direct influences (parents). The model represents the assumption that, given its parent nodes pa[
v
], each node
v
is independent of all other nodes in the graph except descendants of
v
, where descendant has the obvious definition.
Nodes in the graph are of three types.
1.
Constants
are fixed by the design of the study: they are always founder nodes (
i.e.
do not have parents), and are denoted as rectangles in the graph. They must be specified in a data file.
2.
Stochastic node
s are variables that are given a distribution, and are denoted as ellipses in the graph; they may be parents or children (or both). Stochastic nodes may be observed in which case they are
data
, or may be unobserved and hence be
parameters
, which may be unknown quantities underlying a model, observations on an individual case that are unobserved say due to censoring, or simply missing data.
3.
Deterministic nodes
are logical functions of other nodes.
Quantities are specified to be data by giving them values in a data file, in which values for constants are also given.
Directed links may be of two types: a solid arrow indicates a stochastic dependence while a hollow arrow indicates a logical function. An undirected dashed link may also be drawn to represent an upper or lower bound.
Repeated parts of the graph can be represented using a 'plate', as shown below for the range
(i in 1:N)
.
![[model specification1]](model specification1.bmp)
A simple graphical model, where Y[i] depends on mu[i] and tau, with mu[i]
being a logical function of alpha and beta.
The conditional independence assumptions represented by the graph mean that the full joint distribution of all quantities
V
has a simple factorisation in terms of the conditional distribution p(
v
| parents[
v
]) of each node given its parents, so that
p(
V
) =
P
p(
v
| parents[
v
])
v
in
V
The crucial idea is that we need only provide the parent-child distributions in order to fully specify the model, and
WinBUGS
then sorts out the necessary sampling methods directly from the expressed graphical structure.
Graphs as a formal language
[top]
A special drawing tool
DoodleBUGS
has been developed for specifying graphical models, which uses a hyper-diagram approach to add extra information to the graph to give a complete model specification. Each stochastic and logical node in the graph must be given a name using the conventions explained in
Creating a node
.
.![[model specification2]](model specification2.bmp)
The shaded node Y[i] is normally distributed with mean mu[i] and
precision tau.
The shaded node mu[i] is a logical function of alpha, beta, and the
constants x. (x is not required to be shown in the graph).
The value function of a logical node contains all the necessary information to define the logical node: the logical links in the graph are not strictly necessary.
As an alternative to the Doodle representation, the model can be specified using the text-based
BUGS
language, headed by the model statement:
model {
text-based description of graph in BUGS language
}
The
BUGS
language: stochastic nodes
[top]
In the text-based model description, stochastic nodes are represented by the node name followed by a twiddles symbol followed by the distribution name followed by a comma-separated list of parents enclosed in brackets
e.g.
r ~ dbin(p, n)
The distributions that can be used in
WinBUGS
are described in
Distributions
. Clicking on the name of each distribution should provide a link to an example of its use provided with this release. The parameters of a distribution must be explicit nodes in the graph (scalar parameters can also be numerical constants) and so may not be function expressions.
For distributions not featured in
Distributions
, see
Tricks: Advanced Use of the BUGS Language
.
Censoring and truncation
[top]
Censoring is denoted using the notation I(lower, upper)
e.g.
x ~ ddist(theta)I(lower, upper)
would denote a quantity x from distribution ddist with parameters theta, which had been observed to lie between lower and upper. Leaving either lower or upper blank corresponds to no limit, e.g. I(lower,) corresponds to an observation known to lie above lower. Whenever censoring is specified the censored node contributes a term to the full conditional distribution of its parents. This structure is only of use if x has not been observed (if x is observed then the constraints will be ignored).
It is vital to note that this construct does NOT correspond to a truncated distribution
, which generates a likelihood that is a complex function of the basic parameters. Truncated distributions might be handled by working out an algebraic form for the likelihood and using the techniques for arbitrary distributions described in
Tricks: Advanced Use of the BUGS Language
.
It is also important to note that if x, theta, lower and upper are all unobserved, then lower and upper must not be functions of theta.
Constraints on using certain distributions
[top]
Contiguous elements:
Multivariate nodes must form contiguous elements in an array. Since the final element in an array changes fastest, such nodes must be defined as the final part of any array. For example, to define a set of K * K Wishart variables as a single multidimensional array x[i,j,k], we could write:
for (i in 1:I) {
x[i, 1:K, 1:K] ~ dwish(R[i,,], 3)
}
where R[i,,] is an array of specified prior parameters.
No missing data:
Data defined as multinomial or as multivariate Student-t must be complete, in that missing values are not allowed in the data array. We realise this is an unfortunate restriction and we hope to relax it in the future. For multinomial data, it may be possible to get round this problem by re-expressing the multivariate likelihood as a sequence of conditional univariate binomial distributions.
Note that multivariate normal data may now be specified with missing values.
Conjugate updating:
Dirichlet and Wishart distributions can only be used as parents of multinomial and multivariate normal nodes respectively.
Parameters you can't learn about and must specify as constants:
The parameters of Wishart distributions and the order (N) of the multinomial distribution must be specified and cannot be given prior distributions.
Structured precision matrices for multivariate normals:
these can be used in certain circumstances. If a Wishart prior is not used for the precision matrix of a multivariate normal node, then the elements of the precision matrix are updated univariately without any check of positive-definiteness. This will result in a crash
unless the precision matrix is parameterised appropriately.
This is the user's responsibility!
Non-integer data for Poisson and binomial:
Previously only integer-valued data were allowed with Poisson and binomial distributions - this restriction has now been lifted. More generally, it is now possible to specify a Poisson prior for any continuous quantity.
Range constraints - using the I(.,.) notation - cannot be used with multivariate nodes:
except for multivariate normal distributions in which case the arguments to the I(.,.) function may be specified as 'blanks' or as vector-valued bounds.
Logical nodes
[top]
Logical nodes are represented by the node name followed by a left pointing arrow followed by a logical expression of its parent nodes e.g.
mu[i] <- beta0 + beta1 * z1[i] + beta2 * z2[i] + b[i]
Logical expressions can be built using the following operators: plus, multiplication, minus, division and unitary minus. The functions in Table I below can also be used in logical expressions.
In Table I, function arguments represented by
e
can be expressions, those by
s
must be scalar-valued nodes in the graph and those represented by
v
must be vector-valued nodes in a graph.
Table I: Functions
abs(e)
|
e
|
cloglog
(e)
ln(-ln(1 -
e
))
cos(e)
cosine(
e
)
cut(e)
cuts edges in the graph - see
Use of the "cut" function
equals(e1, e2)
1 if
e
1 =
e
2; 0 otherwise
exp
(e)
exp(
e
)
inprod(v1, v2)
S
i
v
1
i
v
2
i
interp.lin(e, v1, v2)
v
2
p
+ (
v
2
p
+
1
-
v
2
p
) * (
e
-
v
1
p
) / (
v
1
p
+
1
-
v
1
p
)
where the elements of
v
1 are in ascending order
and p is such that
v
1
p
<
e
<
v
1
p
+
1
inverse
(v)
v
-1
for symmetric positive-definite matrix
v
log
(e)
ln(
e
)
logdet(v)
ln(det(
v
)) for symmetric positive-definite
v
logfact(e)
ln(
e
!)
loggam(e)
ln(G(
e
))
logit
(e)
ln(
e
/ (1 -
e
))
max(e1, e2)
e
1 if
e
1 >
e
2;
e
2 otherwise
mean
(v)
n
-1
S
i
v
i
n = dim(
v
)
min(e1, e2)
e
1 if
e
1 <
e
2;
e
2 otherwise
phi(e)
standard normal cdf
pow(e1, e2)
e
1
e
2
sin(e)
sine(
e
)
sqrt
(e)
e
1/2
rank(v, s)
number of components of
v
less than or equal to
v
s
ranked(v, s)
the
s
th
smallest component of
v
round(e)
nearest integer to
e
sd(v)
standard deviation of components of
v
(n - 1 in denominator)
step
(e)
1 if
e
>= 0; 0 otherwise
sum(v)
S
i
v
i
trunc(e)
greatest integer less than or equal to
e
A link function can also be specified acting on the left hand side of a logical node
e.g.
logit(mu[i]) <- beta0 + beta1 * z1[i] + beta2 * z2[i] + b[i]
The following functions can be used on the left hand side of logical nodes as link functions: log, logit, cloglog, and probit (where probit(x) <- y is equivalent to x <- phi(y)).
It is important to keep in mind that logical nodes are included only for notational convenience - they cannot be given data or initial values (except when using the data transformation facility described
below
).
Deviance:
A logical node called "deviance" is created automatically by WinBUGS: this stores -2 * log(likelihood), where 'likelihood' is the conditional probability of all data nodes given their stochastic parent nodes. This node can be monitored, and contributes to the DIC function - see
DIC...
Arrays and indexing
[top]
Arrays are indexed by terms within square brackets. The four basic operators +, -, *, and / along with appropriate bracketing are allowed to calculate an integer function as an index, for example:
Y[(i + j) * k, l]
On the left-hand-side of a relation, an expression that always evaluates to a fixed value is allowed for an index, whether it is a constant or a function of data. On the right-hand-side the index can be a fixed value or a named node, which allows a straightforward formulation for mixture models in which the appropriate element of
an array is 'picked' according to a random quantity (see
Nested indexing and mixtures
). However, functions of unobserved nodes are not permitted to appear directly as an index term (intermediate deterministic nodes may be introduced if such functions are required).
The conventions broadly follow those of S-Plus:
n:m represents
n
,
n
+ 1, ...,
m
.
x[] represents all values of a vector
x
.
y[,3] indicates all values of the third column of a two-dimensional array
y
.
Multidimensional arrays are handled as one-dimensional arrays with a constructed index. Thus functions defined on arrays must be over equally spaced nodes within an array: for example sum(i, 1:4, k).
When dealing with unbalanced or hierarchical data a number of different approaches are possible - see
Handling unbalanced datasets
. The ideas discussed in
Nested indexing and mixtures
may also be helpful in this respect; the user should bear in mind, however, the 'contiguous elements' restriction described in
Constraints on using certain distributions
.
Repeated structures
[top]
Repeated structures are specified using a "for-loop". The syntax for this is:
for (i in a:b) {
list of statements to be repeated for increasing values of loop-variable
i
}
Note that neither a nor b may be stochastic - see
here
for a possible way to get round this.
Data transformations
[top]
Although transformations of data can always be carried out before using
WinBUGS
, it is convenient to be able
to try various transformations of dependent variables within a model description. For example, we may wish to try both
y
and
sqrt(y)
as dependent variables without creating a separate variable
z = sqrt(y)
in the data file.
The BUGS language therefore permits the following type of structure to occur:
for (i in 1:N) {
z[i] <- sqrt(y[i])
z[i] ~ dnorm(mu, tau)
}
Strictly speaking, this goes against the declarative structure of the model specification, with the accompanying exhortation to construct a directed graph and then to make sure that each node appears once and only once on the left-hand side of a statement. However, a check has been built in so that, when finding a logical node which also features as a stochastic node (such as z above), a stochastic node is created with the calculated values as fixed data.
We emphasise that this construction is only possible when transforming observed data (not a function of data and parameters) with no missing values.
This construction is particularly useful in Cox modelling and other circumstances where fairly complex functions of data need to be used. It is preferable for clarity to place the transformation statements in a section at the beginning of the model specification, so that the essential model description can be examined separately. See the
Leuk
and
Endo
examples.
Nested indexing and mixtures
[top]
Nested indexing can be very effective. For example, suppose
N
individuals can each be in one of
I
groups, and
g[1:N]
is a vector which contains the group membership. Then "group" coefficients
beta[i]
can be fitted using
beta[g[j]]
in a regression equation.
In the
BUGS
language, nested indexing can be used for the parameters of distributions: for example, the
Eyes
example concerns a normal mixture in which the
i
th
case is in an unknown group
T
i
which determines the mean
l
T
i
of the measurement
y
i
. Hence the model is
T
i
~ Categorical(
P
)
y
i
~ Normal(
l
T
i
,
t
)
which may be written in the BUGS language as
for (i in 1:N) {
T[i] ~ dcat(P[])
y[i] ~ dnorm(lambda[T[i]], tau)
}
However, when using Doodles the parameters of a distribution must be a node in the graph, and so an additional stage is needed to specify the mean
m
i
=
l
T
i
, as shown in the graph below. (We emphasise the care required in establishing convergence of these notorious models.)
![[model specification4]](model specification4.bmp)
Vector parameters can also be identified dynamically, but currently only to a maximum of two dimensions. For example, if we wanted a two-state categorical variable x to have a vector of probabilities indexed by i and j, then we could write x ~ dcat(p[i, j, 1:2]). However, suppose we require three-level indexing, for example
a ~ dcat(p.a[1:2])
b ~ dcat(p.b[1:2])
c ~ dcat(p.c[1:2])
d ~ dcat(p.d[a, b, c, 1:2]
WinBUGS will not permit this, and so the index must be explicitly calculated:
d ~ dcat(p[k, 1:2])
k <- 8 * (a - 1) + 4 * (b - 1) + c
This 'calculated index' trick is useful in many circumstances.
Formatting of data
[top]
Data can be S-Plus format (see most of the
examples
) or, for data in arrays, in rectangular format.
The whole array must be specified in the file - it is not possible just to specify selected components.
Missing values are represented as NA.
All variables in a data file must be defined in a model, even if just left unattached to the rest of the model. In Doodles such variables can be left as constants: in a model description they can be assigned vague priors or allocated to dummy variables.
S-Plus format:
This allows scalars and arrays to be named and given values in a single structure headed by key-word list. There must be no space after list.
For example, in the
Rats
example, we need to specify a scalar
xbar
, dimensions
N
and
T
, a vector
x
and a two-dimensional array
Y
with 30 rows and 5 columns. This is achieved using the following format:
list(
xbar = 22, N = 30, T = 5,
x = c(8.0, 15.0, 22.0, 29.0, 36.0),
Y = structure(
.Data = c(
151, 199, 246, 283, 320,
145, 199, 249, 293, 354,
...................
...................
137, 180, 219, 258, 291,
153, 200, 244, 286, 324),
.Dim = c(30, 5)
)
)
See the
examples
for other use of this format.
WinBUGS
reads data into an array by filling the right-most index first, whereas the S-Plus program fills the left-most index first. Hence
WinBUGS
reads the string of numbers c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) into a 2 * 5 dimensional matrix in the order
[i, j]th element of matrix value
[1, 1] 1
[1, 2] 2
[1, 3] 3
..... ..
[1, 5] 5
[2, 1] 6
..... ..
[2, 5] 10
whereas S-Plus reads the same string of numbers in the order
[i, j]th element of matrix value
[1, 1] 1
[2, 1] 2
[1, 2] 3
..... ..
[1, 3] 5
[2, 3] 6
..... ..
[2, 5] 10
Hence the ordering of the array dimensions must be reversed before using the S-Plus dput command to create a data file for input into
WinBUGS
.
For example, consider the 2 * 5 dimensional matrix
1 2 3 4 5
6 7 8 9 10
This must be stored in S-Plus as a 5 * 2 dimensional matrix:
> M
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
The S-Plus command
> dput(list(M=M), file="matrix.dat")
will then produce the following data file
list(M = structure(.Data = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), .Dim=c(5,2))
Edit the .Dim statement in this file from .Dim=c(5,2) to .Dim=c(2,5). The file is now in the correct format to input the required 2 * 5 dimensional matrix into
WinBUGS
.
Now consider a 3 * 2 * 4 dimensional array
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
21 22 23 24
This must be stored in S-Plus as the 4 * 2 * 3 dimensional array:
> A
, , 1
[,1] [,2]
[1,] 1 5
[2,] 2 6
[3,] 3 7
[4,] 4 8
, , 2
[,1] [,2]
[1,] 9 13
[2,] 10 14
[3,] 11 15
[4,] 12 16
, , 3
[,1] [,2]
[1,] 17 21
[2,] 18 22
[3,] 19 23
[4,] 20 24
The command
> dput(list(A=A), file="array.dat")
will then produce the following data file
list(A = structure(.Data = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1
6, 17, 18, 19, 20, 21, 22, 23, 24), .Dim=c(4,2,3))
Edit the .Dim
statement in this file from .Dim=c(4,2,3) to .Dim=c(3,2,4). The file is now in the correct format to input the required 3 * 2 * 4 dimensional array into
WinBUGS
in the order
[i, j, k]th element of matrix value
[1, 1, 1] 1
[1, 1, 2] 2
..... ..
[1, 1, 4] 4
[1, 2, 1] 5
[1, 2, 2] 6
..... ..
[2, 1, 3] 11
[2, 1, 4] 12
[2, 2, 1] 13
[2, 2, 2] 14
..... ..
[3, 2, 3] 23
[3, 2, 4] 24
Rectangular format:
The columns for data in rectangular format need to be headed by the array name. The arrays need to be of equal size, and the array names must have explicit brackets: for example:
age[] sex[]
26 0
52 1
.....
34 0
END
Note that the file must end with an 'END' keyword, as shown above and below, and this must be followed by at least one blank line.
Multi-dimensional arrays can be specified by explicit indexing: for example, the
Ratsy
file begins
Y[,1] Y[,2] Y[,3] Y[,4] Y[,5]
151 199 246 283 320
145 199 249 293 354
147 214 263 312 328
.......
153 200 244 286 324
END
The first index position for any array must always be empty.
It is possible to load a mixture of rectangular and S-Plus format data files for the same model. For example, if data arrays are provided in a rectangular file, constants can be defined in a separate list statement (see also the
Rats
example with data files
Ratsx
and
Ratsy
).
(See
here
for details of how to handle unbalanced data.)
Note that programs exist for conversion of data from other packages: please see the BUGS resources web-page at
http://www.mrc-bsu.cam.ac.uk/bugs/weblinks/webresource.shtml