Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 109 additions & 77 deletions src/functions-reference/embedded_laplace.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,29 @@ pagetitle: Embedded Laplace Approximation

# Embedded Laplace Approximation

The embedded Laplace approximation can be used to approximate certain
marginal and conditional distributions that arise in latent Gaussian models.
A latent Gaussian model observes the following hierarchical structure:
The embedded Laplace approximation can be used to approximate certain marginal
and conditional distributions that arise in latent Gaussian models.
Embedded Laplace replaces explicit sampling of high-dimensional Gaussian latent
variables with a local Gaussian approximation.
In doing so, it marginalizes out the latent Gaussian variables.
Inference can then be performed on the remaining, often low-dimensional,
parameters. The embedded Laplace approximation in Stan is best suited for
latent Gaussian models when jointly sampling over all model parameters is
expensive and the conditional posterior of the Gaussian latent variables is
reasonably close to Gaussian.

For observed data $y$, latent Gaussian variables $\theta$, and hyperparameters $\phi$, a latent Gaussian model observes the following hierarchical structure:
\begin{eqnarray}
\phi &\sim& p(\phi), \\
\theta &\sim& \text{MultiNormal}(0, K(\phi)), \\
y &\sim& p(y \mid \theta, \phi).
\end{eqnarray}
In this formulation, $y$ represents the
observed data, and $p(y \mid \theta, \phi)$ is the likelihood function that
specifies how observations are generated conditional on the latent Gaussian
variables $\theta$ and hyperparameters $\phi$.
In this formulation, $p(y \mid \theta, \phi)$ is the likelihood function that
specifies how observations are generated conditional on $\theta$ and $\phi$.
$K(\phi)$ denotes the prior covariance matrix for the latent Gaussian variables
$\theta$ and is parameterized by $\phi$.
The prior $p(\theta \mid \phi)$ is restricted to be a multivariate normal
centered at 0. That said, we can always pick a likelihood that offsets $\theta$,
which is equivalently to specifying a prior mean.
$\theta$ and is parameterized by $\phi$. The prior on $\theta$ is centered at 0,
however an offset can always be added when specifying the likelihood function
$p(y \mid \theta, \phi)$.

To sample from the joint posterior $p(\phi, \theta \mid y)$, we can either
use a standard method, such as Markov chain Monte Carlo, or we can follow
Expand All @@ -34,7 +40,7 @@ are typically available in closed form and so they must be approximated.
The marginal posterior can be written as $p(\phi \mid y) \propto p(y \mid \phi) p(\phi)$,
where $p(y \mid \phi) = \int p(y \mid \phi, \theta) p(\theta) \text{d}\theta$
is called the marginal likelihood. The Laplace method approximates
$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution centered at
$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution centered at the mode,
$$
\theta^* = \underset{\theta}{\text{argmax}} \ \log p(\theta \mid y, \phi),
$$
Expand All @@ -53,7 +59,7 @@ using one of Stan's algorithms. The marginal posterior is lower
dimensional and likely to have a simpler geometry leading to more
efficient inference. On the other hand each marginal likelihood
computation is more costly, and the combined change in efficiency
depends on the case.
depends on the application.

To obtain posterior draws for $\theta$, we sample from the normal
approximation to $p(\theta \mid y, \phi)$ in `generated quantities`.
Expand All @@ -62,7 +68,11 @@ then $p(\theta \mid y, \phi)$ produces samples from the joint posterior
$p(\theta, \phi \mid y)$.

The Laplace approximation is especially useful if $p(y \mid \phi, \theta)$ is
log-concave. Stan's embedded Laplace approximation is restricted to the case
log-concave, e.g., Poisson, binomial, negative-binomial, and Bernoulli.
(The normal distribution is also log concave, however when the likelihood is
normal, marginalization can be performed exactly and does not required an
approximation.)
Stan's embedded Laplace approximation is restricted to the case
where the prior $p(\theta \mid \phi)$ is multivariate normal.
Furthermore, the likelihood $p(y \mid \phi, \theta)$ must be computed using
only operations which support higher-order derivatives
Expand All @@ -74,33 +84,36 @@ In the `model` block, we increment `target` with `laplace_marginal`, a function
that approximates the log marginal likelihood $\log p(y \mid \phi)$.
The signature of the function is:

\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...) likelihood\_arguments, function covariance\_function, tuple(...), vector theta\_init covariance\_arguments): real}|hyperpage}
\index{{\tt \bfseries laplace\_marginal }!{\tt (function likelihood\_function, tuple(...) likelihood\_arguments, int hessian_block_size, function covariance\_function, tuple(...)): real}|hyperpage}

<!-- real; laplace_marginal; (function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments); -->
`real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, int hessian_block_size, function covariance_function, tuple(...) covariance_arguments)`

`real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments)`

Which returns an approximation to the log marginal likelihood $p(y \mid \phi)$.
which returns an approximation to the log marginal likelihood $p(y \mid \phi)$.
{{< since 2.37 >}}

This function takes in the following arguments.
The embedded Laplace functions accept two functors whose user defined arguments are passed in as tuples to `laplace_marginal`.

1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables `theta`
2. `likelihood_arguments` - A tuple of the log likelihood arguments whose internal members will be passed to the covariance function
3. `covariance_function` - Prior covariance function
4. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function
1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables $\theta$.
The subsequent arguments are user defined.
- `real likelihood_function(vector theta, likelihood_arguments_1, likelihood_arguments_2, ...)`
2. `likelihood_arguments` - A tuple of arguments whose internal members are be passed to the log likelihood function.
This tuple does NOT include the latent variable $\theta$.
3. `hessian_block_size` - the block size of the Hessian of the log likelihood, $\partial^2 \log p(y \mid \theta, \phi) / \partial \theta^2$.
4. `covariance_function` - A function that returns the covariance matrix of the multivariate normal prior on $\theta$.
- `matrix covariance_function(covariance_argument_1, covariance_argument_2, ...)`
5. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function.

Below we go over each argument in more detail.

## Specifying the log likelihood function {#laplace-likelihood_spec}

The first step to use the embedded Laplace approximation is to write down a
function in the `functions` block which returns the log joint likelihood
function in the `functions` block which returns the log likelihood
$\log p(y \mid \theta, \phi)$.

There are a few constraints on this function:

1. The function return type must be `real`
1. The function return type must be `real`.

2. The first argument must be the latent Gaussian variable $\theta$ and must
have type `vector`.
Expand All @@ -124,7 +137,7 @@ as data or parameter.

The tuple after `likelihood_function` contains the arguments that get passed
to `likelihood_function` *excluding $\theta$*. For instance, if a user defined
likelihood uses a real and a matrix the likelihood function's signature would
likelihood uses a real and a matrix, the likelihood function's signature would
first have a vector and then a real and matrix argument.

```stan
Expand All @@ -149,6 +162,13 @@ for example,
real likelihood_function(vector theta, data vector x, ...)
```

In addition to the likelihood function, users must specify the block size
of the Hessian, $\partial^2 \log p(y \mid \theta, \phi) / \partial \theta^2$.
The Hessian is often block diagonal and this structure can be taken advantage of for fast computation.
For example, if $y_i$ only depends on $\theta_i$, then the Hessian is diagonal and `hessian_block_size=1`.
On the other hand, if the Hessian is not block diagonal, we can always set
`hessian_block_size=n` where $n$ is the size of $\theta$.

## Specifying the covariance function

The argument `covariance_function` returns the prior covariance matrix
Expand All @@ -159,20 +179,6 @@ It's return type must be a matrix of size $n \times n$ where $n$ is the size of
matrix covariance_function(...)
```

<!-- In the `model` block, we increment `target` with `laplace_marginal`, a function
that approximates the log marginal likelihood $\log p(y \mid \phi)$.
This function takes in the
user-specified likelihood and prior covariance functions, as well as their arguments.
These arguments must be passed as tuples, which can be generated on the fly
using parenthesis.
We also need to pass an argument $\theta_0$ which serves as an initial guess for
the optimization problem that underlies the Laplace approximation,
$$
\underset{\theta}{\text{argmax}} \ \log p(\theta \mid y, \phi).
$$
The size of $\theta_0$ must be consistent with the size of the $\theta$ argument
passed to `likelihood_function`. -->

The `...` represents a set of optional
variadic arguments. There is no type restrictions for the variadic arguments
`...` and each argument can be passed as data or parameter. The variables
Expand All @@ -198,51 +204,77 @@ It also possible to specify control parameters, which can help improve the
optimization that underlies the Laplace approximation, using `laplace_marginal_tol`
with the following signature:

\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): real}|hyperpage}

<!-- real; laplace_marginal_tol; (function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch); -->
\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): real}|hyperpage}

`real` **`laplace_marginal_tol`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch)`<br>\newline
```stan
real laplace_marginal_tol(function likelihood_function, tuple(...),
hessian_block_size,
function covariance_function, tuple(...),
tuple(vector theta_init, real tol, int max_steps, int solver,
int max_steps_linesearch, int allow_fallback))
```

Returns an approximation to the log marginal likelihood $p(y \mid \phi)$
and allows the user to tune the control parameters of the approximation.

* `theta_init`: the initial guess for the Newton solver when finding the mode
* `theta_init`: the initial guess for a Newton solver when finding the mode
of $p(\theta \mid y, \phi)$. By default, it is a zero-vector.

* `tol`: the tolerance $\epsilon$ of the optimizer. Specifically, the optimizer
stops when $||\nabla \log p(\theta \mid y, \phi)|| \le \epsilon$. By default,
the value is $\epsilon = 10^{-6}$.
the value is $\epsilon \approx 1.49 \times 10^{-8}$, which is the square-root of machine precision.

* `max_num_steps`: the maximum number of steps taken by the optimizer before
it gives up (in which case the Metropolis proposal gets rejected). The default
is 100 steps.
is 500 steps.

* `hessian_block_size`: the size of the blocks, assuming the Hessian
$\partial \log p(y \mid \theta, \phi) \ \partial \theta$ is block-diagonal.
The structure of the Hessian is determined by the dependence structure of $y$
on $\theta$. By default, the Hessian is treated as diagonal
(`hessian_block_size=1`). If the Hessian is not block diagonal, then set
`hessian_block_size=n`, where `n` is the size of $\theta$.

* `solver`: choice of Newton solver. The optimizer used to compute the
* `solver`: choice of Newton solver. The optimizer underlying the
Laplace approximation does one of three matrix decompositions to compute a
Newton step. The problem determines which decomposition is numerical stable.
By default (`solver=1`), the solver makes a Cholesky decomposition of the
negative Hessian, $- \partial \log p(y \mid \theta, \phi) / \partial \theta$.
If `solver=2`, the solver makes a Cholesky decomposition of the covariance
matrix $K(\phi)$.
If the Cholesky decomposition cannot be computed for neither the negative
Hessian nor the covariance matrix, use `solver=3` which uses a more expensive
but less specialized approach.
Newton step. The problem determines which decomposition is numerically stable.
By default (`solver=1`), the solver attempts a Cholesky decomposition of the
negative Hessian of the log likelihood,
$- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$.
This operation is legal if the negative Hessian is positive-definite,
which will always be true when the likelihood is log concave.
If `solver=2`, the solver makes a Cholesky decomposition of the covariance matrix $K(\phi)$.
Since a covariance matrix is always positive-definite, compute its
Cholesky decomposition is always a legal operation, at least in theory.
In practice, we may not be able to compute the Cholesky decomposition of the
negative Hessian or the covariance matrix, either because it does not exist or
because of numerical issues.
In that case, we can use `solver=3` which uses a more expensive but less
specialized approach to compute a Newton step.

* `max_steps_linesearch`: maximum number of steps in linesearch. The linesearch
method tries to insure that the Newton step leads to a decrease in the
objective function. If the Newton step does not improve the objective function,
the step is repeatedly halved until the objective function decreases or the
maximum number of steps in the linesearch is reached. By default,
`max_steps_linesearch=0`, meaning no linesearch is performed.
adjusts to step size to ensure that a Newton step leads to an increase in
the objective function (i.e., $f(\theta) = p(\theta \mid \phi, y)$).
If a standard Newton step does not improve the objective function,
the step is adjusted iteratively until the objective function increases
or the maximum number of steps in the linesearch is reached.
By default, `max_steps_linesearch=1000`.
Setting `max_steps_linesearch=0` results in no linesearch.

* `allow_fallback`: If user set solver fails, this flag determines whether to fallback to the next solver. For example, if the user specifies `solver=1` but the Cholesky decomposition of the negative Hessian $- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$ fails, the optimizer will try `solver=2` instead.
By default, `allow_fallback = 1` (TRUE).

The embedded Laplace approximation's options have a helper callable `generate_laplace_options(int theta_size)` that will generate the tuple for you. This can be useful for quickly setting up the control parameters in the `transformed data` block to reuse within the model.

```stan
tuple(vector[theta_size], real, int, int, int, int, int) laplace_ops = generate_laplace_options(theta_size);
// Modify solver type
laplace_ops.5 = 2;
// Turn off fallthrough
laplace_ops.7 = 0;
```

The arguments stored in the `laplace_ops` tuple are,
```
laplace_ops = {theta_init,
tol,
max_num_steps,
hessian_block_size,
solver,
max_steps_linesearch,
allow_fallback}
```

{{< since 2.37 >}}

Expand All @@ -253,18 +285,18 @@ approximation of $p(\theta \mid \phi, y)$ using `laplace_latent_rng`.
The signature for `laplace_latent_rng` follows closely
the signature for `laplace_marginal`:

<!-- vector; laplace_latent_rng; (function likelihood_function, tuple(...), function covariance_function, tuple(...)); -->
\index{{\tt \bfseries laplace\_latent\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init): vector}|hyperpage}
<!-- vector; laplace_latent_rng; (function likelihood_function, tuple(...), int hessian_block_size, function covariance_function, tuple(...)); -->
\index{{\tt \bfseries laplace\_latent\_rng }!{\tt (function likelihood\_function, tuple(...), int hessian_block_size, function covariance\_function, tuple(...)): vector}|hyperpage}

`vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...))`<br>\newline
`vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), hessian_block_size, function covariance_function, tuple(...))`<br>\newline

Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$.
Draws samples from the Laplace approximation to the conditional posterior $p(\theta \mid y, \phi)$.
{{< since 2.37 >}}

Once again, it is possible to specify control parameters:
\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): vector}|hyperpage}
\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), int hessian_block_size, function covariance\_function, tuple(...), tuple(...) laplace_ops): vector}|hyperpage}

`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch)`<br>\newline
`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), int hessian_block_size, function covariance_function, tuple(...), tuple(...) laplace_ops)`<br>\newline
Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$
and allows the user to tune the control parameters of the approximation.
{{< since 2.37 >}}
Expand Down
Loading