October 27, 2016

## Variational Inference: Overview

• Notation: Let $$\mathbf{x}=(x_1, \ldots, x_n)'$$ be observed data, $$\mathbf{Z}=(Z_1, \ldots, Z_m)'$$ be unobserved variables (e.g., unobserved cluster membership indicators, model parameters, etc.). If there exists other hyperparameters $$\alpha$$, we assume them fixed for now.

• Problem: Calculate the posterior distribution $p(\mathbf{Z}\mid \mathbf{x},\alpha) = \frac{p(\mathbf{Z}, \mathbf{x}\mid \alpha)}{\int_{\mathbf{Z}}p(\mathbf{Z}, \mathbf{x}\mid \alpha)\mathrm{d}\boldsymbol{Z}},$ which is hard for complex likelihood and priors

• One Solution: Approximate the posterior using a simpler distribution, which is the closest to the actual posterior in a computationally feasible family of distributions. (How to pick such a family? Given this family, how to obtain the "closest" one?)

## Example: Gaussian Mixture

• Generative Model:
1. $$\mu_k\sim N(0,\tau^2)$$, $$k=1, \ldots, K$$.
2. $$Z_i\sim {\sf Categorical}(\mathbf{\pi})$$, $$i=1,\ldots, n$$, $$\mathbf{\pi}=(\pi_1, \ldots, \pi_K)'$$, $$\pi_k\geq 0$$, $$\sum_k{\pi_k}=1$$
3. $$x_i\sim N(\mu_{Z_i}, \sigma^2)$$, $$i=1,\ldots, n$$.
• Given $$\tau$$ and $$\sigma$$ and data, Gaussian means ($$\mu_k$$) and indicators ($$Z_i$$) is $p(\mathbf{\mu},\mathbf{Z}\mid \mathbf{x})=\frac{\prod_{k}p(\mu_k)\prod_{i}p(Z_i)p(x_i\mid Z_i, \mathbf{\mu})}{p(\mathbf{x})}$

• Denominator (marginal distribution of the observed data) is a sum of $$K^n$$ terms: $$p(\mathbf{x})=\sum_{\mathbf{Z}}\int_\mathbf{\mu}\prod_{k}p(\mu_k)\prod_{i}p(Z_i)p(x_i\mid Z_i, \mathbf{\mu})$$

• Use $$q$$ with fitted parameters $$\hat{\mathbf{\nu}}$$ as proxy; predictions, parameter inference, etc. (Caution for posterior s.d.)

## Main idea

• Pick a family of distributions over the unobserved (latent) variables $$\mathbf{Z}$$, indexed by variational parameters($$\mathbf{\nu}$$): $q(Z_1, Z_2, \ldots, Z_m\mid \mathbf{\nu})$
• Find the value(s) of $$\mathbf{\nu}$$ that best approximates the posterior of interest
• Remark: we are approximating a distribution given the data $$\mathbf{x}$$ at hand: $$p(\mathbf{Z}\mid \mathbf{X}=\mathbf{x})$$, not all the conditional distributions $$\{p(\mathbf{Z}\mid \mathbf{X}=\mathbf{x})\}_{\mathbf{x}\in \mathcal{X}}$$

## Kullback-Leibler Divergence: Measure the closeness of two distributions

• Definition $KL(q\|p)=E_q\left[\log\frac{q(\mathbf{Z})}{p(\mathbf{Z}\mid \mathbf{x})}\right]$

• Nonegative; Equals zero only when $$q(\mathbf{z})=p(\mathbf{z}\mid\mathbf{x})$$ holds

## Evidence Lower Bound for The log probability of the observed data

\begin{aligned} \log p(\mathbf{x}) & = \log \int_{\mathbf{Z}}p(\mathbf{x},\mathbf{Z}) \mathrm{d} \mathbf{Z} = \log \int_{\mathbf{Z}}p(\mathbf{x},\mathbf{Z})\frac{q(\mathbf{Z})}{q(\mathbf{Z})} \mathrm{d} \mathbf{Z} = \log \left(E_q\left[\frac{p(\mathbf{x},\mathbf{Z})}{q(\mathbf{Z})}\right]\right)\\ & \geq E_q[\log p(\mathbf{x},\mathbf{Z})]-E_q[\log q(\mathbf{Z})]\overset{\Delta}{=} ELB \end{aligned}

## KL Distance between the Truth and Approximating Distribution

\begin{aligned} KL(q(\mathbf{Z})\|p(\mathbf{Z}\mid\mathbf{x})) & = E_q\left[\log \frac{q(\mathbf{Z})}{p(\mathbf{Z}\mid\mathbf{x})}\right]\\ & = E_q[\log q(\mathbf{Z})]-E_q[\log p(\mathbf{Z},\mathbf{x})]+\log p(\mathbf{x})\\ & = -\left\{E_q[\log p(\mathbf{x},\mathbf{Z})]-E_q[\log q(\mathbf{Z})]\right\}+\log p(\mathbf{x})\\ & = -ELB+\text{a term independent of}~q \end{aligned}

• Key: Minimizing KL distance over $$q$$ amounts to maximizing ELB over $$q$$, which is a lower bound for the log probability of observed data.

## Mean Field Variational Inference (Choosing the family of $$q$$)

• Assume $$q(Z_1, \ldots, Z_m)=\prod_{j=1}^mq(Z_j)$$; Independence model.
• $$\mathbf{Z}$$ can be grouped to relax to group-level factorization
• Typically, this family does not include the true posterior, which usually possess dependence among $$\mathbf{Z}$$
• Gaussian mixture: $$Z_i$$ dependent of each other and $$\mathbf{\mu}$$ given data (show the DAG)
• Dependence makes things hard; mean-field approximation simplifies computation using component-wise scheme, which is suitable for parallelization

## Optimizing ELB($$q$$), or ELB($$\mathbf{\nu}$$) by Coordinate Ascent

• Chain rule: $$p(\mathbf{Z},\mathbf{X})=p(\mathbf{x})\prod_{j=1}^mp(Z_j\mid Z_{1:{j-1}}, \mathbf{x})$$
• Entropy of variational distribution $$q$$: $$E_q[\log q(\mathbf{Z})]=\sum_{j=1}^mE_{q(Z_j)}[\log q(Z_j)]$$; Will use $$E_j$$ to denote the expectation taken with respect to the marginal distribution of $$Z_j$$ implied by $$q(\mathbf{Z})$$.
• ELB($$q$$) decomposed as $ELB(q)=\log p(\mathbf{x})+\sum_j \left\{E_q[p(Z_j\mid Z_{1:{j-1}}, \mathbf{x})]-E_j[\log q(Z_j)]\right\}$

## Coordinate Ascent

• Recipe: optimize over $$q(Z_j)$$, one at a time.
• For coordinate $$k$$; Use $$Z_k$$ as the last variable in the chain rule (why? So the chain rule has just one term dependent of $$Z_k$$)
• ELB viewed as a function of $$q(Z_k)$$: $ELB_k=E[\log p(Z_k\mid Z_{-k},\mathbf{x})]-E_k[\log q(Z_k)]+const$
• Expand: $ELB_k=\int_{Z_k} q(Z_k) E_{-k}\log p(Z_k\mid Z_{[-k]},\mathbf{x}) d Z_k-\int_{Z_k}q(Z_k)\log q(Z_k) d Z_k+const$
• Equlibrium: $\frac{d ELB_k}{d q(Z_k)} = E_{-k}\log p(Z_k\mid Z_{[-k]},\mathbf{x})-\log q(Z_k)-1=0$

## Coordinate Ascent (continued)

• Equilibrium plus Lagrange multiplier that the maximizer must intergrate to one: we have the coordinate ascent update for $$q(Z_k)$$ $q^*(z_k) \propto \exp{E_{-k} \log p(z_k\mid Z_{-k}, \mathbf{x})},$ which simplifies to $$q^*(z_k) \propto\exp\left\{{E_{-k} \log p(z_k, Z_{-k}, \mathbf{x})}\right\}$$
• Iteratve over $$k=1,\ldots, m$$; At convergence we obtain $$q^*$$ as a local maximizer of the ELB($$q$$)
• Use $$q^*$$ as the proxy for the true posterior
• Connection to Gibbs sampling:
• GS: sample from full conditionals $$p(Z_k\mid Z_{-k},\mathbf{x})$$; iterate
• Coordinate Ascent: $$q^*(z_k)\propto \exp(E_{-k}[\log p(Z_k\mid Z_{-k}, \mathbf{x})])$$

## Remarks

• Why this works: in the Expand step, we used the property that $$q$$ can be facotrized across dimensions.
• The updating formula of $$q^*$$: a result of the factorization
• The optimal $$q^*$$ might not be easy to work with; but easy for many models. (conditional Multinomial example.)

## Exponential Family (EF) Conditionals

• Suppose each conditional is in the exponential family $p(z_j\mid z_{-j},\mathbf{x})=h(z_j)\exp\{\eta(z_{-j},\mathbf{x})'t(z_j)-a(\eta(z_{-j},\mathbf{x})\}$
• Rich applications:
• Kalman filters
• Hierarchical Hiddem Markov Models (HMM)
• Mixed-membership model of exponential families
• Bayesian linear regression
• Bayesian mixtures of exponential families with conjugate priors

## Mean-Field Variational Inference

1. Compute the log of conditionals $$\log p(Z_j\mid Z_{-j},\mathbf{x})$$
2. Compute the expectation with respect to $$q(Z_{-j})$$
3. Coordinate ascent update
4. Iterate

Remark:

• The optimal $$q^*(z_j)$$ is in the same exponential family as the conditional
• If we set $$q(\mathbf{z}\mid \mathbf{\nu})=\prod_jq(z_j\mid \nu_j)$$, where $$q(z_j\mid \nu_j)$$ is in the same exponential family as the conditionals, the coordinate ascent algorithm just iteratively sets each $$\nu_j$$ (natural variational parameter) equal to the expectation of the natural conditional parameter for variable $$z_j$$ (recall ML equation for EF): $\nu_j^*=E_{-j}[\eta(Z_{-j}, \mathbf{x})]$

## Comment

• Required reading:
• Section 10.1-10.4, Bishop CM (2006), Pattern Recognition and Machine Learning.