Lecture 8: Exact Inference by Belief Propagation (Sum-Product Algorithm)

September 29, 2016

Lecture 7 Main Points Once Again

Marginal probabilities. Compute marginals of variables (given model parameters \(\mathbf{\theta}\)): \(p(x_i\mid \mathbf{\theta})=\sum_{\mathbf{x}': x_i'=x_i}p(\mathbf{x}'\mid \mathbf{\theta}).\) (or posterior distribution, aka, query probabilities)
Technique: Variable elimination to avoid the computational complexity that is exponential in dimension
Why it works
- Use the fact that some factors only involve a small number of variables
- By computing intermediate factors and caching the results, we avoid duplicated calculations
Q: What if we calculated a particular query probability \(p(x_1\mid x_6)\), and now we want to calculate \(p(x_4\mid x_3)\)? How to share the work across them.
A: This motivates the study of more sophisticated graph representation methods, including factor graphs and tree representation of UG.

Induced Graphs: Variable Elimination: \(p(x_1, x_6)\)

Example of variable elimination \(p(x_1, x_6)\)

\[ \begin{aligned} p(x_1,x_6) & = \sum_{x_2,x_3,x_4,x_5}p(x_2\mid x_1)p(x_3\mid x_2)p(x_4\mid x_2)p(x_5\mid x_3,x_4)p(x_6\mid x_5) \\ &= \sum_{x_2,x_3,x_4}p(x_2\mid x_1)p(x_3\mid x_2)p(x_4\mid x_2)\underbrace{\left\{\sum_{x_5}p(x_5\mid x_3,x_4)p(x_6\mid x_5)\right\}} \\ &= \sum_{x_2}p(x_2\mid x_1)\sum_{x_3}p(x_3\mid x_2)\sum_{x_4}p(x_4\mid x_2)\left\{\sum_{x_5}p(x_5\mid x_3,x_4)p(x_6\mid x_5)\right\} \end{aligned} \]

Inside-out strategy: local variable elimination, computed and cached for further eliminations.
Because eliminating one variable \(X_5\) means integrating all the messages that should be sent to \(X_6\) from each state of \(X_5=x_5\) via factor \(\psi_{X_5,X_6}\), we multiply the factor with messages sent to the factor.

Factor Graphs

We've shown how to use variable elimination to efficiently compute marginal probabilities on DAGs
We now extend to undirected graphs using factor graphs that generalize both DAG and UG.
DAG and UG focus on capturing the conditional independencies in the probability distribution through (d-)separation relations.
Factor graphs aim to capture the factorization scheme of a multiriviate joint distribution. (That is, we focus on the exponential representation of a distribution)

Examples: Factor Graphs

The factor graph representation is not unique if only given the graph structure.

Sum-Product Algorithms for Factor Trees

Tree-structured factor graph: the undirected graph obtained by ignoring the distinction between variable nodes and factor nodes is a tree.
On these graphs, exact inference is possible.
If loops exist in the underlying undirected graph, the belief propagation algorithm needs to initialize the messages, iterate using the message updating rules. The iterations may or may not converge to stable message tables and beliefs. The converged values may at best be approximations to the exact marginal probabilities.

Sum-Product Algorithm for Factor Trees

Let \(N(x)\) or \(N(f)\) denote the neighbors of a variable node or a factor node. The marginal probability of \(X_i\) can be evaluated as \(p(x_i)=\prod_{f_C\in N(x_i)} m_{f_C\rightarrow x_i}(x_i)\)

If \(f_C\) is a leaf node, then \(m_{f_C\rightarrow x}(X_C)=f_C(x_C)\)
If \(x\) is a leaf node, then \(m_{x\rightarrow f_C}(x)=1\)
The message sent from a non-leaf factor node \(f_C\) to a variable node \(X_i\) is \[m_{f_C\rightarrow x_i}(x_i)=\sum_{x_{N(f_C)-x_i}}f_C(x_C)\prod_{x'\in N(f_C)-x_i} m_{x'\rightarrow f_C}(x')\]
The message sent from a non-leaf variable node \(X_i\) to a factor node \(f_C\) is: \[m_{x_i\rightarrow f_C}(x_i) =\prod_{f_{C'} \in N(x_i)-f_C} m_{f_{C'}\rightarrow x_i}(x_i)\]

Comment on the Sum-Product Algorithm

Suppose all \(X_i\in {1,\ldots, k}\), \(d\) nodes, \(M\) factors.

Intermediate "factor-to-variable" message table: \(m_{f_C\rightarrow x_i}(x_i)\), computed from at most \(k^{v_i}\) multiplications and \(k^{|N(f_C)|-1}-1\) additions, where \(v_i=|X_C|+|N(f_C)|-1\) is the number of variables involved. We have at most \(d\times M\) such messages.
"variable-to-factor" massage table computed from \(k(|N(x_i)|-1)\) multiplications. We have at most \(d\times M\) such messages.
marginal probability computed from \(1\) multiplication (use cached message table computed by Rule 3 and 4)
Intermediate message tables make sharing of computational work easy

Example: Sum-Product Algorithm

Calculate the marginal distribution \(p(x_1)\) using the 4 rules in the previous slide \[p(x_1)=\sum_{x_2}\cdots\sum_{x_6}p(\mathbf{x})\]

Extensions

What if the factor graph is not a tree: junction-tree algorithm that cluster maximal cliques into supernodes which then form a tree
What if we care about the configuration with the maximum probability: Max-Product or Max-Sum (after taking the logarithmic transformation of the probabilities) algorithm

Comment

Next Lecture: Belief propagation demonstration on Hidden Markov Models
- Forward-backward algorithm (special case of Sum-Product algorithm)
- Viterbi algorithm (special case of Max-Product or Max-Sum algorithm)
Optional Reading: Yedidia, J.S., Freeman, W.T. and Weiss, Y., 2003. Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium, 8, pp.236-239.
- Belief propagation introduced by pairwise Markov networks: any factor graph can be transformed into a pairwise Markov network, vice versa.
- For graphs not singlely-connected, we can trade-off the accuracy of approximating the marginal probabilities and the computational efficiency. Pairwise belief propagation (Bethe approximation) is not accurate when there are short loops in the graph, but fast; Large-cluster-based belief propagation (Kikuchi approximation) is more accurate, but more computational intensive because of larger intermediate tables.