October 27, 2016

Variational Inference: Overview

  • Notation: Let \(\mathbf{x}=(x_1, \ldots, x_n)'\) be observed data, \(\mathbf{Z}=(Z_1, \ldots, Z_m)'\) be unobserved variables (e.g., unobserved cluster membership indicators, model parameters, etc.). If there exists other hyperparameters \(\alpha\), we assume them fixed for now.

  • Problem: Calculate the posterior distribution \[p(\mathbf{Z}\mid \mathbf{x},\alpha) = \frac{p(\mathbf{Z}, \mathbf{x}\mid \alpha)}{\int_{\mathbf{Z}}p(\mathbf{Z}, \mathbf{x}\mid \alpha)\mathrm{d}\boldsymbol{Z}},\] which is hard for complex likelihood and priors

  • One Solution: Approximate the posterior using a simpler distribution, which is the closest to the actual posterior in a computationally feasible family of distributions. (How to pick such a family? Given this family, how to obtain the "closest" one?)

Example: Gaussian Mixture

  • Generative Model:
    1. \(\mu_k\sim N(0,\tau^2)\), \(k=1, \ldots, K\).
    2. \(Z_i\sim {\sf Categorical}(\mathbf{\pi})\), \(i=1,\ldots, n\), \(\mathbf{\pi}=(\pi_1, \ldots, \pi_K)'\), \(\pi_k\geq 0\), \(\sum_k{\pi_k}=1\)
    3. \(x_i\sim N(\mu_{Z_i}, \sigma^2)\), \(i=1,\ldots, n\).
  • Given \(\tau\) and \(\sigma\) and data, Gaussian means (\(\mu_k\)) and indicators (\(Z_i\)) is \[p(\mathbf{\mu},\mathbf{Z}\mid \mathbf{x})=\frac{\prod_{k}p(\mu_k)\prod_{i}p(Z_i)p(x_i\mid Z_i, \mathbf{\mu})}{p(\mathbf{x})}\]

  • Denominator (marginal distribution of the observed data) is a sum of \(K^n\) terms: \(p(\mathbf{x})=\sum_{\mathbf{Z}}\int_\mathbf{\mu}\prod_{k}p(\mu_k)\prod_{i}p(Z_i)p(x_i\mid Z_i, \mathbf{\mu})\)

  • Use \(q\) with fitted parameters \(\hat{\mathbf{\nu}}\) as proxy; predictions, parameter inference, etc. (Caution for posterior s.d.)

Main idea

  • Pick a family of distributions over the unobserved (latent) variables \(\mathbf{Z}\), indexed by variational parameters(\(\mathbf{\nu}\)): \[q(Z_1, Z_2, \ldots, Z_m\mid \mathbf{\nu})\]
  • Find the value(s) of \(\mathbf{\nu}\) that best approximates the posterior of interest
  • Remark: we are approximating a distribution given the data \(\mathbf{x}\) at hand: \(p(\mathbf{Z}\mid \mathbf{X}=\mathbf{x})\), not all the conditional distributions \(\{p(\mathbf{Z}\mid \mathbf{X}=\mathbf{x})\}_{\mathbf{x}\in \mathcal{X}}\)

Kullback-Leibler Divergence: Measure the closeness of two distributions

  • Definition \[KL(q\|p)=E_q\left[\log\frac{q(\mathbf{Z})}{p(\mathbf{Z}\mid \mathbf{x})}\right]\]

  • Nonegative; Equals zero only when \(q(\mathbf{z})=p(\mathbf{z}\mid\mathbf{x})\) holds

Evidence Lower Bound for The log probability of the observed data

\[ \begin{aligned} \log p(\mathbf{x}) & = \log \int_{\mathbf{Z}}p(\mathbf{x},\mathbf{Z}) \mathrm{d} \mathbf{Z} = \log \int_{\mathbf{Z}}p(\mathbf{x},\mathbf{Z})\frac{q(\mathbf{Z})}{q(\mathbf{Z})} \mathrm{d} \mathbf{Z} = \log \left(E_q\left[\frac{p(\mathbf{x},\mathbf{Z})}{q(\mathbf{Z})}\right]\right)\\ & \geq E_q[\log p(\mathbf{x},\mathbf{Z})]-E_q[\log q(\mathbf{Z})]\overset{\Delta}{=} ELB \end{aligned} \]

KL Distance between the Truth and Approximating Distribution

\[ \begin{aligned} KL(q(\mathbf{Z})\|p(\mathbf{Z}\mid\mathbf{x})) & = E_q\left[\log \frac{q(\mathbf{Z})}{p(\mathbf{Z}\mid\mathbf{x})}\right]\\ & = E_q[\log q(\mathbf{Z})]-E_q[\log p(\mathbf{Z},\mathbf{x})]+\log p(\mathbf{x})\\ & = -\left\{E_q[\log p(\mathbf{x},\mathbf{Z})]-E_q[\log q(\mathbf{Z})]\right\}+\log p(\mathbf{x})\\ & = -ELB+\text{a term independent of}~q \end{aligned} \]

  • Key: Minimizing KL distance over \(q\) amounts to maximizing ELB over \(q\), which is a lower bound for the log probability of observed data.

Mean Field Variational Inference (Choosing the family of \(q\))

  • Assume \(q(Z_1, \ldots, Z_m)=\prod_{j=1}^mq(Z_j)\); Independence model.
  • \(\mathbf{Z}\) can be grouped to relax to group-level factorization
  • Typically, this family does not include the true posterior, which usually possess dependence among \(\mathbf{Z}\)
    • Gaussian mixture: \(Z_i\) dependent of each other and \(\mathbf{\mu}\) given data (show the DAG)
    • Dependence makes things hard; mean-field approximation simplifies computation using component-wise scheme, which is suitable for parallelization

Optimizing ELB(\(q\)), or ELB(\(\mathbf{\nu}\)) by Coordinate Ascent

  • Chain rule: \(p(\mathbf{Z},\mathbf{X})=p(\mathbf{x})\prod_{j=1}^mp(Z_j\mid Z_{1:{j-1}}, \mathbf{x})\)
  • Entropy of variational distribution \(q\): \(E_q[\log q(\mathbf{Z})]=\sum_{j=1}^mE_{q(Z_j)}[\log q(Z_j)]\); Will use \(E_j\) to denote the expectation taken with respect to the marginal distribution of \(Z_j\) implied by \(q(\mathbf{Z})\).
  • ELB(\(q\)) decomposed as \[ELB(q)=\log p(\mathbf{x})+\sum_j \left\{E_q[p(Z_j\mid Z_{1:{j-1}}, \mathbf{x})]-E_j[\log q(Z_j)]\right\}\]

Coordinate Ascent

  • Recipe: optimize over \(q(Z_j)\), one at a time.
  • For coordinate \(k\); Use \(Z_k\) as the last variable in the chain rule (why? So the chain rule has just one term dependent of \(Z_k\))
  • ELB viewed as a function of \(q(Z_k)\): \[ELB_k=E[\log p(Z_k\mid Z_{-k},\mathbf{x})]-E_k[\log q(Z_k)]+const\]
  • Expand: \[ELB_k=\int_{Z_k} q(Z_k) E_{-k}\log p(Z_k\mid Z_{[-k]},\mathbf{x}) d Z_k-\int_{Z_k}q(Z_k)\log q(Z_k) d Z_k+const\]
  • Equlibrium: \[\frac{d ELB_k}{d q(Z_k)} = E_{-k}\log p(Z_k\mid Z_{[-k]},\mathbf{x})-\log q(Z_k)-1=0\]

Coordinate Ascent (continued)

  • Equilibrium plus Lagrange multiplier that the maximizer must intergrate to one: we have the coordinate ascent update for \(q(Z_k)\) \[q^*(z_k) \propto \exp{E_{-k} \log p(z_k\mid Z_{-k}, \mathbf{x})},\] which simplifies to \(q^*(z_k) \propto\exp\left\{{E_{-k} \log p(z_k, Z_{-k}, \mathbf{x})}\right\}\)
  • Iteratve over \(k=1,\ldots, m\); At convergence we obtain \(q^*\) as a local maximizer of the ELB(\(q\))
  • Use \(q^*\) as the proxy for the true posterior
  • Connection to Gibbs sampling:
    • GS: sample from full conditionals \(p(Z_k\mid Z_{-k},\mathbf{x})\); iterate
    • Coordinate Ascent: \(q^*(z_k)\propto \exp(E_{-k}[\log p(Z_k\mid Z_{-k}, \mathbf{x})])\)

Remarks

  • Why this works: in the Expand step, we used the property that \(q\) can be facotrized across dimensions.
  • The updating formula of \(q^*\): a result of the factorization
  • The optimal \(q^*\) might not be easy to work with; but easy for many models. (conditional Multinomial example.)

Exponential Family (EF) Conditionals

  • Suppose each conditional is in the exponential family \[p(z_j\mid z_{-j},\mathbf{x})=h(z_j)\exp\{\eta(z_{-j},\mathbf{x})'t(z_j)-a(\eta(z_{-j},\mathbf{x})\}\]
  • Rich applications:
    • Kalman filters
    • Hierarchical Hiddem Markov Models (HMM)
    • Mixed-membership model of exponential families
    • Bayesian linear regression
    • Bayesian mixtures of exponential families with conjugate priors

Mean-Field Variational Inference

  1. Compute the log of conditionals \(\log p(Z_j\mid Z_{-j},\mathbf{x})\)
  2. Compute the expectation with respect to \(q(Z_{-j})\)
  3. Coordinate ascent update
  4. Iterate

Remark:

  • The optimal \(q^*(z_j)\) is in the same exponential family as the conditional
  • If we set \(q(\mathbf{z}\mid \mathbf{\nu})=\prod_jq(z_j\mid \nu_j)\), where \(q(z_j\mid \nu_j)\) is in the same exponential family as the conditionals, the coordinate ascent algorithm just iteratively sets each \(\nu_j\) (natural variational parameter) equal to the expectation of the natural conditional parameter for variable \(z_j\) (recall ML equation for EF): \[\nu_j^*=E_{-j}[\eta(Z_{-j}, \mathbf{x})]\]

Comment

  • Required reading:
    • Section 10.1-10.4, Bishop CM (2006), Pattern Recognition and Machine Learning.