September 27, 2016

## Course Logistics

• Office hour: 2-3pm on Tuesdays or by appointment
• New office: 4623 SPH-I (within Suite 4605)
• Homework 1 due 11:59pm on October 10, 2016 (Questions?)

## Inference Problems: Three Types of Tasks

• Marginal probabilities. Compute marginals of variables (given model parameters $$\mathbf{\theta}$$): $$p(x_i\mid \mathbf{\theta})=\sum_{\mathbf{x}': x_i'=x_i}p(\mathbf{x}'\mid \mathbf{\theta}).$$ (Or conditional probabilities: posterior distribution)

• Partition function. For a Gibbs distribution representable as normalized products of factors: $$p(\mathbf{x})=\frac{1}{Z}\prod_{C\in\mathcal{C}}\psi_{C}(\mathbf{x}_C)$$ with respect to a graph $$\mathcal{G}$$ with cliques $$\mathcal{C}$$, compute $Z(\mathbf{\theta})=\sum_{\mathbf{x}}\prod_{C\in\mathcal{C}}\psi_{C}(\mathbf{x}_C)$

• Maximum A Posteriori (MAP) Inference. Compute the variable configuration with the hightest probability: $$\hat{\mathbf{x}}=\arg \max_{\mathbf{x}}p(\mathbf{x}\mid \mathbf{\theta}),$$ where $$p$$ is a generic notation for joint or conditional distributions with unobserved variables $$\mathbf{X}=\mathbf{x}$$.

## Compute Marginal Probabilities

• Let $$\mathbf{X}=\{X_1,\ldots,X_d\}$$ denote the set of variable nodes, each of which can range from $$1$$ to $$k$$; $$F=\{\psi_{\alpha_1},\ldots, \psi_{\alpha_s}\}$$ denote the set of factor nodes, where $$\alpha_1, \ldots, \alpha_s \subset \{1,\ldots, d\}$$.
• Compute the marginal probability naively, we do: $p(X_1=x_1)=\frac{1}{Z}\sum_{x_2,\ldots, x_d}\prod_{i=1}^s \psi_{\alpha_i}(\mathbf{x}_{\alpha_i})$

• Computational complexity: $$\mathcal{O}(k^{d-1})$$ additions of values read from the probability table, one row for each configuration $$X_2=x_2,\ldots,X_d=x_d$$.

• Instead, we choose an ordering $$\mathcal{I}$$ among $$\{X_1,\ldots,X_d\}$$ and utilize the factorization of $$p(\mathbf{x})$$. (Variable Elimination)

## Factor Graph

• Explicitly encondes the relations among a subset of variables, which sometimes are not possible by clique potential parameterization (see an example on the next slide).
• Supports the inference/computational techniques
• Unifies directed and undirected graphical models
• Example (whiteboard) ## Factor Graph: A Refinement

• Example:
• $$\mathcal{H}=(V,E)$$: complete (fully connected) undirected graph with vertices $$V$$ and edges $$E$$. $$|V|=d$$
• $$P$$: positive distribution over $$d$$ binary nodes; has only pairwise potentials (pairwise dependencies cannot be conditioned away).
• $$P$$ is Markov to $$\mathcal{H}$$, but not Markov to any graph missing even one edge.
• The graph $$\mathcal{H}$$ can only guide use to use the maximal clique potential parametrization. For a omplete graph, we use a single, large potential $$\psi(X_1,\ldots,X_d)$$, which requires $$2^d$$ parameters.
• However, $$P$$ by definition can be represented in a Gibbs distribution only with pairwise potentials $$\psi_{{j}{j'}}(X_j,X_{j'})$$, which requires $$4{d \choose 2}$$ parameters
• When $$d=10$$, $$2^d=1,024$$, much bigger than $$4{d \choose 2}=180$$.

## Variable Eliminations for $$p(X_1=x_1)$$ for $$p(X_1,\ldots, X_d)$$

1. Initialize an active list of potentials as read off from the factor graph.
2. Choose an ordering $$\mathcal{I}$$ for the variables in which $$x_1$$ appears last
3. For each $$i=1,\ldots,d-1$$, let $$\mathcal{I}_i=j$$, do the following:
• Find all potentials in the currently active list that reference $$x_j$$ and remove them from the active list
• Let $$\phi_{j}(X_{T_j})$$ denote the product of these potentials, with $$T_j$$ being the union of nodes appearing in the active potentials referencing $$X_j$$.
• Calculate $$\tau_i(x_{T_j-X_j})=\sum_{x_j}\phi_{j}(x_{T_j})$$
• Place $$\tau_i(x_{T_j-X_j})$$ on the active list as one potential function
4. Finally, we obtain $$p(x_1)=\frac{1}{Z}\tau_{d-1}(x_1)$$

## Example: Variable Elimination for Factor Graph

• Let $$X_i\in \{1,\ldots,k\}$$.
• Describe how you would compute $$p(X_1=x_1)$$.
• Computational complexity of variable elimination: $$\mathcal{O}((d-1)k^r)$$, where $$d$$ is the number of variables, $$k$$ is the maximum value a variable can take and $$r$$ is the number of variables participating in the largest intermediate "factor".
• Finding a good ordering can reduce the computational complexity. ## Comment

• Next Lecture: Belief propagation (Sum-Product Algorithm on Polytrees)

• Required reading for the week: Chapter 9. Koller and Friedman (2009)