Mathematics of Deep Learning: Lecture 8 – Hierarchical Generative Models for Deep Learning

Transcribed by Frederic Koehler (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

Today we will study a model interpolating between tree reconstruction (introduced last time) and deep learning (topic of this class).

Recall the model last time: we have a broadcasting process on a tree \(T\) with root \(r\) and \(\ell\) levels, where \(k\) samples, each of the form \(\{\sigma'(v)\}_{v \in \text{Nodes}(T)}\) are generated by the following process:

  1. Sample \(\sigma'(r)\) for the root \(r\) uniformly from \([q] = \{1, \ldots, q\}\).
  2. Given \(\sigma'(w)\) and \(v\) is a child of \(w\), then with probability \(\theta\) set \(\sigma'(v) = \sigma'(w)\) and with probability \(1 – \theta\) draw \(\sigma'(v)\) from the uniform distribution on \([q]\).

We let \(\sigma(v)\) be the \(k\)-tuple of the \(k\) i.i.d. samples of \(\sigma'(v)\). Last time we discussed how you can reconstruct the structure of \(T\) given \(\{\sigma(v)\}_{v \in \text{Leaves}(T)}\).

A Model Between Tree Reconstruction and Deep Learning

We will now introduce two models that build on one another and interpolate between tree reconstruction and deep learning.

First Step: Semi-supervised Learning Model

The first step in our interpolation is to convert the problem into one of semi-supervised learning. We attach to each vertex \(v\) in the graph a set of labels \(L(v)\), with the constraint that is \(v\) is above \(w\) then \(L(v) \subset L(w)\), i.e. we only accumulate labels as we go down the tree. Furthermore we require that if \(a \in L(v_1) \cap L(v_2)\) then the lowest common ancestor \(v\) of \(v_1\) and \(v_2\) satisfies \(a \in L(v)\).

The motivation for this comes from evolution: we should think that the labels consist of tags like “Dog” and “German Shepherd” (in which case the \(\sigma(v)\) at the leaves model the genetic data of each species), and the rules give that every descendant of a dog is a dog and that every two dogs most common ancestor was also a dog.

Now that we have labels we can consider the following semi-supervised learning task: the data is \(\{\sigma(v)\}_{v \in Level(\ell)}\) and \(\{(\sigma(v), L(v)) : v \in S\}\) for some subset \(S\) of the leaves (\(Level(\ell)\)). Our task is to infer \(L(v)\) for all \(v \in Level(\ell)\). Roughly speaking this models the problem that we have the genetic code for many animals, some of which are labeled by tags such as “Dog” and “Cat” etc. and now we want to infer for all of the unlabelled animals whether they are a dog or cat as well.

Second Step: Adding Nonlinearity

Now we move on to the second step: we want to introduce nonlinear functions of multiple variables (like the sigmoid composed with linear map in the neurons of a neural network). We modify the way \(\sigma(v)\) is generated given its parent \(\sigma(w)\):

  1. For \(i = 1, \ldots, k\), let \(\tilde{\sigma}(v)_i\) be \(\sigma(w)_i\) with probability \(\theta\) and uniform at random from \([q]\) with probability \(1 – \theta\).
  2. Define \[(\sigma(v)_{2i – 1}, \sigma(v)_{2i}) := \Pi_{(w, v)} \left(\tilde{\sigma}(v)_{\Sigma_{|v|}(2i – 1)}, \tilde{\sigma}(v)_{\Sigma_{|v|}(2i)}\right)\] where \(\Pi_{(w,v)} : [q]^2 \to [q]^2 \) is a function depending only on \(w\) and \(v\), and \(\Sigma_{|v|}\) is a permutation on \([k]\) depending only on the level of \(v\). (Note in particular that neither of these depend on \(i\)).

Thus the variable \(\sigma_{i}(v)\) is a function of noisy copies of two different entries of \(\sigma(w)\). The ability to choose \(\Sigma\) gives us freedom in which of these two entries they are, for example \(\sigma_i\) could depend on one of positions \(i\) and \(i + 2^r\) in its parent and this roughly models the way convolutional neural networks mix across different length scales in each layer.

Also note that applying just one \(\Pi \in S_{q^2}\) is already interesting; i.e. how can we recover the root from the leaves in a known tree? For example if \(\Pi\) is drawn uniformly at random, and only at the last layer, and \(k = 1\) then by symmetry \(\sigma(r)\) is independent from the values of \(\sigma\) at the leaves which is markedly different from the situation without \(\Pi\).

Roughly the analogy to a standard neural network is as follows: \(\Sigma\) represents the connections of neurons, \(\Pi\) is the nonlinear activation composed with a linear function and the noise is a “nuisance variable” that makes objects different (instead of them just being copies of each other). For example when \(\Sigma = Id, \Pi = Id\) and there is no noise, the network is just copying. We can also visualize a sort of “deep net” of a leaf where we look at the interchanges of the variables (given by \(\Sigma\)) between \(\sigma\) on the path from the root to the leaf.

Learning with a Deep Algorithm

We now move on to discussing learning tasks.

The Semi-supervised Learning Task

Formally our goal is to solve the semi-supervised learning problem explained above. Practically, in order to solve this problem we will try to learn hidden variables \(\Sigma_{|v|}\), \(\Pi_{w,v}\) and \(\sigma(v)\) (for \(v\) non-leaves) as well as the tree structure, and then use all of this information to determine labels. We begin with an easy claim.

To identify the labels of the leaves in \(Level(\ell) \setminus S\) correctly, then it must hold that for every \(a\) that appears in the leaves, there exists some \(u, v \in S\) such that \(a \in L(u) \cap L(v)\) and \(MCA(\{\text{Leaves labeled by } a\}) = MCA(\{u,v\})\) where \(MCA\) denotes the most common ancestor. We refer to this as a good labeling .

For example, if we only have one labelled example of a tiger there is no way to infer that some unlabeled examples are tigers, since there is no way to even figure out if the direct parent of the labeled example is also labeled as a tiger.

Observe that this is true even if we are given the tree, and being given the tree \(\sigma\) tells us nothing about the label assignments e.g. the label assignments and \(\sigma\) could have been both generated independently at random.

On the other hand, if the tree is known and we have a good labeling then it is obvious that we can infer all of the correct labels: first propagate the labels up the tree from \(S\) using the good labeling condition, then propagate this information down to the leaves (since labels are hereditary).

Recovering the Tree

What remains to discuss is the problem of recovering the tree. We will not attempt to use the labels for this task, so this part is an unsupervised learning task.

First we assume \(\Sigma\) is known. We consider the task of learning siblings.

Learning Siblings: First we consider how to test whether \(u,v\) are siblings with parent \(w\) given that \(\Pi_{w,u} = \Pi_{w,v} = Id\). Suppose this is the case, then if we let \[X := \# \{ i \in [k] : \sigma(v)_i = \sigma(u)_i \}\] we observe that there is a \(\theta^2\) chance of a given coordinate being “copied” from \(v\) to \(u\) and otherwise a \(1/q\) chance of the coordinate being the same, thus \[\mathbb{E}[X] = k \theta^2 + \frac{k(1 – \theta^2)}{q}\] and by the Chernoff bound this is very concentrated (sub-Gaussian with variance proxy of order \(k\)). Moreover if \(w\) and \(v\) are not siblings then the expectation is smaller, i.e. the first term is at least as bad as \(k \theta^4\) so for sufficiently large \(k\) we can distinguish these two cases with high probability.

In general, we can do this test by enumerating over all possible \(\Pi_{w,u}\) and \(\Pi_{w,v}\): if \(w\) and \(v\) are not siblings, then with high probability we will never observe a small \(X\), whereas if they are then when we choose the right permutations we will observe a small \(X\) with high probability.

Note that it is actually impossible to recover \(\Pi_{w,v}\) and \(\Pi_{w,u}\). If we take some permutation \(\pi\) and replace \(\Pi_{w,v}\) with \(\Pi_{w,v} \circ \pi\) and \(\Pi_{w,u}\) with \(\Pi_{w,u} \circ \pi\) then these new permutations will be equally good for the purposes of our statistical test above, and in fact if we make this change for a whole level we can see that the distribution of the generated data is identical, so the permutations \(\Pi\) are undeterdetermined. A technical argument can be used to show that this is the only degree of freedom for the \(\Pi\) and so we can recover up to this degree of freedom.

Recursive Algorithm: Now that we can learn siblings, we can recurse similarly to the case of tree reconstruction. Thus we can recover the tree and under the condition of a good labeling we can perform semi-supervised learning with \(k = \theta(\ell)\).

We end by noting that if \(S\) is very large then there are other ways to learn labels without reconstructing the tree; e.g. if \(S\) consists of all but a few labels, then we can learn labels of unlabeled leaves just by finding their nearest-neighbor among the elements of \(S\).

Lower Bounds

Note that in the above result we used a recursive (“deep”) algorithm. This leads us to this natural question: can we show that simpler algorithms don’t perform as well?

Local Algorithms

First we will show that local algorithms require a suboptimal number of samples. The definition of a local algorithm is that when it labels \(w \notin S\), it uses only \(\{(\sigma(v), L(v)) : v \in S\}\) and \(\sigma(w)\); it does not use any of the other unlabelled data.

We choose the following input distribution over labels: Fix an integer \(h\) and \(h_0 < h\). There are no labels for the first \(h_0 - 1\) levels. At level \(h_0\), we label each vertex distinctly, in a random order. For each vertex \(v\) at level \(h_0\), we pick its leftmost and rightmost descendant (i.e. leftmost leaf of the overall tree under \(v\) and righmost leaf of the tree under \(v\)) and add these two vertices to \(S\).

The probability that a local algorithm labels a random leaf correctly is less than \(d^{-h_0}(1 + O(k \theta^{h – h_0}))\).
Note that it is trivial to achieve \(d^{-h_0}\) by guessing randomly. As a consequence, to get better than random even when \(h_0 = 1\) we need \(k \theta^{h – 1} = \Omega(1)\) so we need \(k = \Omega(\theta^{-(h – 1)}) = \Omega(|T|^{\beta})\) for some constant \(\beta\) depending on \(\theta\). In comparison the “deep” algorithm we described before needs only \(k = \Omega(h) = \Omega(\log |T|)\).
[Sketch] The little cheat we make is that if \(w\) is in the same tree with labeled leaves \(v_1, v_2\) then we assume \(MCA(w,v_1) = MCA(w, v_2) = MCA(v_1, v_2)\) (It is in fact true that most \(v_1,v_2\) are far apart, which is a general and easy to prove fact about trees that we can use to make the following argument rigorous).

Now the proof “trick” is to give the algorithm extra information, \((Label(v), \sigma(v))\) for each vertex \(v\) at level \(h_0\). Given this extra information, \(\sigma(w)\) is independent of \(\{\sigma(v) : v \in S\}\) by the Markov property (since removing the vertices at level \(h_0\) leaves \(w\) in a connected component by itself, relying on the “cheat” described above). We also make the algorithm’s life “easy” by setting \(\Pi = Id\) and \(\Sigma = Id\).

Now we have reduced the problem to a pure hypothesis testing question. Let \(H_i\) be the hypothesis that \(w\) is labelled by \(i\), i.e. \(w\) is a child of \(v_i\) where \(v_i\) is the vertex at level \(h_0\) labeled by \(i\). Note that the distribution of \(w\) given the true \(v_i\) is that of a noisy copy with parameter \(\theta^{h – h_0}\). Thus with probability at least \(1 – k \theta^{h – h_0}\) all of the coordinates of \(w\) have been resampled from the uniform distribution. By a coupling argument, it follows that any hypothesis test labels correctly with probability at most \[(1 – k\theta^{h – h_0}) (1/d) + k \theta^{h – h_0} \le (1/d) + k\theta^{h – h_0}.\]

Shallow Algorithms

Next we want to show that certain shallow algorithms also cannot recover as well as the “deep” algorithm we presented first. Formally, we define a shallow algorithm to be one that is given access only the following data:

  1. For each label \(a\) and each position \(i\) is given \[c_i(a,x) = \#\{\text{leaves $\nu$ in $S$ labelled by $a$ such that $\sigma(v)_i = x$}\}\]
  2. all unlabeled data

In other words, we have summarized the labeled data by count statistics and do not let the algorithm look at higher order moments of the labelled data.

We consider the family of input instances where the direct children of the root are labeled \(1, \ldots, d\) and in each of their subtrees, every leaf is labeled except the rightmost child. Then we have the following theorem:

If \(d \theta^2 < 1\), a shallow algorithm requires \(k \ge (1/\theta)^c \ge \Omega(|T|^{c'})\) samples in order to label correctly with probability greater than \(2/d\). This is true even if \(\Sigma = Id\) and \(\Pi = Id\).

We do not present the whole proof of the theorem but just observe that it follows from known lower bounds on count reconstruction.

Finally we present an open problem: The lower bounds should be much better when \(\Pi\) are random and the \(\Sigma_j\) mix different elements. More specifically, the bound for shallow algorithms should hold if \(\theta = 1 – c^h\) for some \(c< 1\) ‘for some \(c < 1\), instead of requiring \(\theta < 1/\sqrt{d}\) when \(\Pi\) are random and \(\Sigma\) “mixes” well (e.g. \(\Sigma_r(x) = x + 2^r\)).

Mathematics of Deep Learning: Lecture 7 – Recovering Tree Models

Transcribed by Paxton Turner (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

Tree Reconstruction II

We prove Claim 2 which was left over from last time.

Fix a constant \(\ell \in \mathbb{N}\). Given a \(d\)-ary tree with percolation parameter \(\lambda < 1\) which is sufficiently large, then \[ \mathbb{P}_{\lambda}\left[\text{root is in an} \, \ell-\text{diluted} \, (d^{\ell} - 1) \, \text{-ary open tree}\right] \geq 1 - \epsilon \]
f(p) := \mathbb{P}[\text{Bin}(d^\ell, p) \geq d^{\ell} – 1],
and note that \(f\) is monotone in \(p\). We see that \(f(1) = 1\), and \(f(p) = p^{d^\ell} + d^{\ell} (1-p) p^{d^{\ell} – 1}\). Also
f'(1) = d^\ell + d^\ell(d^\ell – 1 – d^\ell) = 0.
Thus taking \(p^*\) sufficiently close to 1 yields that \(f(p) > p\) for all \(p^* < p < 1\). Also choose \(p^*\) to satisfy \(p^* > 1- \epsilon\). Now choose \(\lambda <1\) such that \[ q = \mathbb{P}[\text{root is connected to all} \, d^\ell \, \text{nodes in the } \, \ell \, \text{level}] \geq p^*/f(p^*). \] Now we apply a recursive argument. Let \[ p_i = \mathbb{P}[\text{root is in an} \, \ell-\text{diluted} \, (d^\ell - 1) \, \text{-ary open tree for} \, i \cdot \ell \, \text{levels}]. \] We will show by induction that \(p_r \geq p^*\) for all \(r\), which implies the lemma. Note that \(p_0 = 1\) and \[ p_{r+1} \geq q \cdot \mathbb{P}\left[\text{Bin}(d^\ell, p_r) \geq d^{\ell} - 1\right] = \frac{p^* f(p_r)}{f(p^*)} \geq p^* \] since \(f\) is monotone and \(p_r \ge p^*\) by the induction hypothesis. The lemma is proved.

Recovering the Tree Structure

We begin with an \(\ell\)-level full binary tree and the size of the alphabet to be \(q = \infty\). We generate \(k\) independent samples of the broadcasting procedure with parameter \( \lambda \) on the tree.

In contrast to the last lecture, today, we do not know the tree structure , i.e. we only receive the colors of the nodes and do not know which nodes are siblings, cousins, etc. Our goal is to determine how large \(k\) must be to recover the tree structure.

In practice, these trees are phylogenetic trees, and the information we have are the DNA/amino acid sequences. Thus, usually \(q = 4\), or \(q = 20\).

Question : How large should \(k, \ell, \lambda\) be so that we recover the tree correctly as \(\ell \to \infty\)?

Recovering Distances Using Correlations

If \(u,v\) are graph distance \(2r\) apart, then
\mathbb{P}[\sigma(u) = \sigma(v)] = \lambda^{2r}.
How large should \(k\) be so that all distances can be computed accurately? Trivially, if \(k \lambda^{2\ell – 2} = 1/2\) and \(d(u,v) = 2 \ell – 2\), then taking a union bound gives
\mathbb{P}[\exists \, \text{sample with} \, \sigma(v) = \sigma(u’)] \leq 1/2.
Then we can’t say if \(d(u,v) = 2 \ell\) or \(d(u,v) = 2 \ell – 2\). If we use concentration, then
k \geq \log(2^\ell) \left(\frac{1}{\lambda^\ell}\right)^2 \log(1/\delta)
allows us to compute distances correctly with probability \(\geq 1 – \delta\). One can show this with Chernoff bounds, and the details are omitted here. If we compute all of the distances between the leaves correctly, then we trivially can recover the tree structure. However, as noted above, this requires at least \(k \geq c (1/\lambda)^{2 \ell} = |T|^\beta\) samples.

Recursive “Deep” Algorithms for \(q = \infty\)

Computing all of the distances between the leaves seems like too strong of a requirement. This raises the question: Is there a deep algorithm that tries to infer the labels at the internal nodes sample by sample that can recover the tree with fewer samples? We will outline such an algorithm when \( 2 \lambda > 1\).

After detecting siblings, we want to understand the probability of being able to recover colors at their parents sample by sample. Consider first a small example. Suppose we have a 2-level binary tree with leaves labeled. If there is a 1 in both the left and right branches, then with probability 1, the root is colored 1. However, if no color appears in both branches, we mark the node with a question mark \(?\). Returning to our original problem, we can first apply this procedure to identify siblings among the leaves and the labels of their parents sample by sample, and then apply this to the nodes at level \( \ell – 1\), and so on.

If \(2 \lambda > 1\), then there exists \(C(\lambda) > 0\) such that
\mathbb{P}[? \, \text{recovered at level} r] \leq 1 – C(\lambda).
Note that an internal node is not a ? if its label survives in its left and right subtrees. Therefore
\mathbb{P}[\text{a node at level } r \, \text{is not} \, ?] &\geq \mathbb{P}[\text{root connected to 2 children}] \\
&= \mathbb{P}[\text{branching process survives}]^2 \\
&= C(\lambda) > 0
if \(2 \lambda > 1\).

How many samples are needed to identify all siblings correctly?
k= \frac{(1 – \lambda)^4}{\lambda} \log \left( \frac{2^\ell}{\delta}\right)
suffices for error \(\leq \delta\). It turns out that there is a similar bound for recovering the tree if \(2 \lambda > 1,\). As observed above, if we purely used distance methods, then we require \(k \geq c (1/\lambda)^{2 \ell} = |T|^\beta\). However, if \(2 \lambda > 1\), then our “deep” algorithm requires only \(k = \log(|T|)\) samples.

If \(2 \lambda < 1\), then what is true? The goal is to show you can't distinguish between the uniform distribution. Heuristically, we must estimate the probability of getting all question marks \(?\) for the lower bound.

Lower bound proof idea: Start with four trees having roots \(a, b, c,\) and \(d\). We want to couple the broadcasting procedure on these trees so that they produce the same data. In such a coupling, we can show
\mathbb{P}[\text{generating different samples}] \leq 4(2 \lambda)^\ell.
\mathbb{P}[\text{coupling succeeds}] \geq 1 – 4k(2 \lambda)^\ell.
Thus if we want recover the tree correctly with probability \(\geq 1/\delta\), we need \(k \geq c(2 \lambda)^\ell = \Omega(|T|^{\ell})\) samples (if \( \lambda < 1\)).

If \(2 \lambda < 1\), distance methods do not give optimal sample complexity.

We make the following general conjecture.

Conjecture : Recursive methods should give optimal complexity.

A Recursive Algorithm for \(q = 2\)

Next we consider a new model. Now we have a full binary tree, and on all edges, we copy with probability \(\eta\). Our alphabet now has size \(q = 2\), and in particular, we set the “letters” to be \(\{-1, 1\}\).

Using distance methods, we cannot distinguish between
\mathbb{E}[\sigma(u) \sigma(v)] = \eta^{2 \ell}, \quad \mathbb{E}[\sigma(v’) \sigma(v)] = \eta^{2 \ell – 2}

The lower bound for distance methods here is \(k \geq c/\eta^{2 \ell}\), and this is more or less tight. As before, we will instead use a recursive approach: first we identify siblings, and then estimate colors of the parents. To identify parents, use the correlations
\mathbb{E}[\sigma(v) \sigma(v’)] = \eta^2
if \(v\) and \(v’\) are siblings. We use correlations to recover distances up to a constant \(i\) levels above. And this requires \(\approx \log(2^\ell)\) samples to concentrate. Then we can use majority vote to recover the color of the parent. By a similar computation to the 2nd moment method from last lecture, if \(2 \eta^2 > 1\),
\mathbb{P}[\text{Maj}(x_1, \ldots, x_{2^r}) = x_{\text{root}}] \geq \frac{1}{2} + \epsilon
for all \(r\). Thus we define
\hat{\sigma}_i = \text{Maj}\left(\sigma_i(u): u \, \text{leaf below} \, v \right).
Note that
\mathbb{E}[\sigma_{\text{root}} \hat{\sigma}_{\text{root}}] = \rho > 0 \Rightarrow \mathbb{E}[\hat{\sigma}_i(v) \hat{\sigma}_i(w)] = \eta^{2d(v,w)} \rho^2.
Also note that \(\rho = \Theta(\epsilon)\).

Results If \(2 \eta^2 > 1\), we can recover the tree with \(k = \log(|T|)\) samples. Otherwise, \(k \geq |T|^\beta\). Compare this to the thresholds \(d \eta^2 > 1\) for count reconstruction and \(d \lambda_2^2 > 1\) for general broadcasting.

Four point method for detecting distances

We define a new twist on the previous model. The parameter \(\eta\) is now non-uniform, but we still require it to satisfy \(2 \eta^2(e) \geq 1 + \epsilon\) for all edges \(e\). In this general set-up, one has to be more careful with detecting distances/siblings.

Four point method : This is a trick from phylogeny. For the purposes of detecting distances, there are 3 possible 2 level binary trees. The neighbors are either \(\{(A, B), (C,D)\}\), \(\{(A, C), (B, D) \}\), and \(\{ (A, D), (B, C) \}\). We use pairwise distances to determine which tree it is. Set
Dist(u,v) = \log\left( \frac{1}{\mathbb{E}[\sigma_u \sigma_v]}\right).

The correct pairing of \(A, B, C\) and \(D\) is the one for which
Dist(X,Y) + Dist(W,V)
is minimal, where \((X, Y, W, V)\) is some permutation of \((A, B, C, D)\).

Now to determine siblings of \(v\), find all vertices with
|Dist(u,v)| \leq 3 \log(1/\sqrt{2}),
and apply the four point method on these vertices.

One can also run into an issue when estimating the colors of parents. To solve this, use an \(\ell\) level recursive estimator.

If \(2 \eta^2 > 1\), then there exists \(\ell(\eta)\) such that using an \(\ell\)-level recursive majority gives estimator \(\hat{\sigma}\) such that
\mathbb{E}[\sigma \hat{\sigma}] \geq 1/2 + \epsilon.

In the next lecture, we discuss hierarchical generative models.