## Mathematics of Deep Learning: Lecture 8 – Hierarchical Generative Models for Deep Learning

Transcribed by Frederic Koehler (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

Today we will study a model interpolating between tree reconstruction (introduced last time) and deep learning (topic of this class).

Recall the model last time: we have a broadcasting process on a tree $$T$$ with root $$r$$ and $$\ell$$ levels, where $$k$$ samples, each of the form $$\{\sigma'(v)\}_{v \in \text{Nodes}(T)}$$ are generated by the following process:

1. Sample $$\sigma'(r)$$ for the root $$r$$ uniformly from $$[q] = \{1, \ldots, q\}$$.
2. Given $$\sigma'(w)$$ and $$v$$ is a child of $$w$$, then with probability $$\theta$$ set $$\sigma'(v) = \sigma'(w)$$ and with probability $$1 – \theta$$ draw $$\sigma'(v)$$ from the uniform distribution on $$[q]$$.

We let $$\sigma(v)$$ be the $$k$$-tuple of the $$k$$ i.i.d. samples of $$\sigma'(v)$$. Last time we discussed how you can reconstruct the structure of $$T$$ given $$\{\sigma(v)\}_{v \in \text{Leaves}(T)}$$.

# A Model Between Tree Reconstruction and Deep Learning

We will now introduce two models that build on one another and interpolate between tree reconstruction and deep learning.

## First Step: Semi-supervised Learning Model

The first step in our interpolation is to convert the problem into one of semi-supervised learning. We attach to each vertex $$v$$ in the graph a set of labels $$L(v)$$, with the constraint that is $$v$$ is above $$w$$ then $$L(v) \subset L(w)$$, i.e. we only accumulate labels as we go down the tree. Furthermore we require that if $$a \in L(v_1) \cap L(v_2)$$ then the lowest common ancestor $$v$$ of $$v_1$$ and $$v_2$$ satisfies $$a \in L(v)$$.

The motivation for this comes from evolution: we should think that the labels consist of tags like “Dog” and “German Shepherd” (in which case the $$\sigma(v)$$ at the leaves model the genetic data of each species), and the rules give that every descendant of a dog is a dog and that every two dogs most common ancestor was also a dog.

Now that we have labels we can consider the following semi-supervised learning task: the data is $$\{\sigma(v)\}_{v \in Level(\ell)}$$ and $$\{(\sigma(v), L(v)) : v \in S\}$$ for some subset $$S$$ of the leaves ($$Level(\ell)$$). Our task is to infer $$L(v)$$ for all $$v \in Level(\ell)$$. Roughly speaking this models the problem that we have the genetic code for many animals, some of which are labeled by tags such as “Dog” and “Cat” etc. and now we want to infer for all of the unlabelled animals whether they are a dog or cat as well.

Now we move on to the second step: we want to introduce nonlinear functions of multiple variables (like the sigmoid composed with linear map in the neurons of a neural network). We modify the way $$\sigma(v)$$ is generated given its parent $$\sigma(w)$$:

1. For $$i = 1, \ldots, k$$, let $$\tilde{\sigma}(v)_i$$ be $$\sigma(w)_i$$ with probability $$\theta$$ and uniform at random from $$[q]$$ with probability $$1 – \theta$$.
2. Define $(\sigma(v)_{2i – 1}, \sigma(v)_{2i}) := \Pi_{(w, v)} \left(\tilde{\sigma}(v)_{\Sigma_{|v|}(2i – 1)}, \tilde{\sigma}(v)_{\Sigma_{|v|}(2i)}\right)$ where $$\Pi_{(w,v)} : [q]^2 \to [q]^2$$ is a function depending only on $$w$$ and $$v$$, and $$\Sigma_{|v|}$$ is a permutation on $$[k]$$ depending only on the level of $$v$$. (Note in particular that neither of these depend on $$i$$).

Thus the variable $$\sigma_{i}(v)$$ is a function of noisy copies of two different entries of $$\sigma(w)$$. The ability to choose $$\Sigma$$ gives us freedom in which of these two entries they are, for example $$\sigma_i$$ could depend on one of positions $$i$$ and $$i + 2^r$$ in its parent and this roughly models the way convolutional neural networks mix across different length scales in each layer.

Also note that applying just one $$\Pi \in S_{q^2}$$ is already interesting; i.e. how can we recover the root from the leaves in a known tree? For example if $$\Pi$$ is drawn uniformly at random, and only at the last layer, and $$k = 1$$ then by symmetry $$\sigma(r)$$ is independent from the values of $$\sigma$$ at the leaves which is markedly different from the situation without $$\Pi$$.

Roughly the analogy to a standard neural network is as follows: $$\Sigma$$ represents the connections of neurons, $$\Pi$$ is the nonlinear activation composed with a linear function and the noise is a “nuisance variable” that makes objects different (instead of them just being copies of each other). For example when $$\Sigma = Id, \Pi = Id$$ and there is no noise, the network is just copying. We can also visualize a sort of “deep net” of a leaf where we look at the interchanges of the variables (given by $$\Sigma$$) between $$\sigma$$ on the path from the root to the leaf.

# Learning with a Deep Algorithm

We now move on to discussing learning tasks.

Formally our goal is to solve the semi-supervised learning problem explained above. Practically, in order to solve this problem we will try to learn hidden variables $$\Sigma_{|v|}$$, $$\Pi_{w,v}$$ and $$\sigma(v)$$ (for $$v$$ non-leaves) as well as the tree structure, and then use all of this information to determine labels. We begin with an easy claim.

To identify the labels of the leaves in $$Level(\ell) \setminus S$$ correctly, then it must hold that for every $$a$$ that appears in the leaves, there exists some $$u, v \in S$$ such that $$a \in L(u) \cap L(v)$$ and $$MCA(\{\text{Leaves labeled by } a\}) = MCA(\{u,v\})$$ where $$MCA$$ denotes the most common ancestor. We refer to this as a good labeling .

For example, if we only have one labelled example of a tiger there is no way to infer that some unlabeled examples are tigers, since there is no way to even figure out if the direct parent of the labeled example is also labeled as a tiger.

Observe that this is true even if we are given the tree, and being given the tree $$\sigma$$ tells us nothing about the label assignments e.g. the label assignments and $$\sigma$$ could have been both generated independently at random.

On the other hand, if the tree is known and we have a good labeling then it is obvious that we can infer all of the correct labels: first propagate the labels up the tree from $$S$$ using the good labeling condition, then propagate this information down to the leaves (since labels are hereditary).

## Recovering the Tree

What remains to discuss is the problem of recovering the tree. We will not attempt to use the labels for this task, so this part is an unsupervised learning task.

First we assume $$\Sigma$$ is known. We consider the task of learning siblings.

Learning Siblings: First we consider how to test whether $$u,v$$ are siblings with parent $$w$$ given that $$\Pi_{w,u} = \Pi_{w,v} = Id$$. Suppose this is the case, then if we let $X := \# \{ i \in [k] : \sigma(v)_i = \sigma(u)_i \}$ we observe that there is a $$\theta^2$$ chance of a given coordinate being “copied” from $$v$$ to $$u$$ and otherwise a $$1/q$$ chance of the coordinate being the same, thus $\mathbb{E}[X] = k \theta^2 + \frac{k(1 – \theta^2)}{q}$ and by the Chernoff bound this is very concentrated (sub-Gaussian with variance proxy of order $$k$$). Moreover if $$w$$ and $$v$$ are not siblings then the expectation is smaller, i.e. the first term is at least as bad as $$k \theta^4$$ so for sufficiently large $$k$$ we can distinguish these two cases with high probability.

In general, we can do this test by enumerating over all possible $$\Pi_{w,u}$$ and $$\Pi_{w,v}$$: if $$w$$ and $$v$$ are not siblings, then with high probability we will never observe a small $$X$$, whereas if they are then when we choose the right permutations we will observe a small $$X$$ with high probability.

Note that it is actually impossible to recover $$\Pi_{w,v}$$ and $$\Pi_{w,u}$$. If we take some permutation $$\pi$$ and replace $$\Pi_{w,v}$$ with $$\Pi_{w,v} \circ \pi$$ and $$\Pi_{w,u}$$ with $$\Pi_{w,u} \circ \pi$$ then these new permutations will be equally good for the purposes of our statistical test above, and in fact if we make this change for a whole level we can see that the distribution of the generated data is identical, so the permutations $$\Pi$$ are undeterdetermined. A technical argument can be used to show that this is the only degree of freedom for the $$\Pi$$ and so we can recover up to this degree of freedom.

Recursive Algorithm: Now that we can learn siblings, we can recurse similarly to the case of tree reconstruction. Thus we can recover the tree and under the condition of a good labeling we can perform semi-supervised learning with $$k = \theta(\ell)$$.

We end by noting that if $$S$$ is very large then there are other ways to learn labels without reconstructing the tree; e.g. if $$S$$ consists of all but a few labels, then we can learn labels of unlabeled leaves just by finding their nearest-neighbor among the elements of $$S$$.

# Lower Bounds

Note that in the above result we used a recursive (“deep”) algorithm. This leads us to this natural question: can we show that simpler algorithms don’t perform as well?

## Local Algorithms

First we will show that local algorithms require a suboptimal number of samples. The definition of a local algorithm is that when it labels $$w \notin S$$, it uses only $$\{(\sigma(v), L(v)) : v \in S\}$$ and $$\sigma(w)$$; it does not use any of the other unlabelled data.

We choose the following input distribution over labels: Fix an integer $$h$$ and $$h_0 < h$$. There are no labels for the first $$h_0 - 1$$ levels. At level $$h_0$$, we label each vertex distinctly, in a random order. For each vertex $$v$$ at level $$h_0$$, we pick its leftmost and rightmost descendant (i.e. leftmost leaf of the overall tree under $$v$$ and righmost leaf of the tree under $$v$$) and add these two vertices to $$S$$.

The probability that a local algorithm labels a random leaf correctly is less than $$d^{-h_0}(1 + O(k \theta^{h – h_0}))$$.
Note that it is trivial to achieve $$d^{-h_0}$$ by guessing randomly. As a consequence, to get better than random even when $$h_0 = 1$$ we need $$k \theta^{h – 1} = \Omega(1)$$ so we need $$k = \Omega(\theta^{-(h – 1)}) = \Omega(|T|^{\beta})$$ for some constant $$\beta$$ depending on $$\theta$$. In comparison the “deep” algorithm we described before needs only $$k = \Omega(h) = \Omega(\log |T|)$$.
[Sketch] The little cheat we make is that if $$w$$ is in the same tree with labeled leaves $$v_1, v_2$$ then we assume $$MCA(w,v_1) = MCA(w, v_2) = MCA(v_1, v_2)$$ (It is in fact true that most $$v_1,v_2$$ are far apart, which is a general and easy to prove fact about trees that we can use to make the following argument rigorous).

Now the proof “trick” is to give the algorithm extra information, $$(Label(v), \sigma(v))$$ for each vertex $$v$$ at level $$h_0$$. Given this extra information, $$\sigma(w)$$ is independent of $$\{\sigma(v) : v \in S\}$$ by the Markov property (since removing the vertices at level $$h_0$$ leaves $$w$$ in a connected component by itself, relying on the “cheat” described above). We also make the algorithm’s life “easy” by setting $$\Pi = Id$$ and $$\Sigma = Id$$.

Now we have reduced the problem to a pure hypothesis testing question. Let $$H_i$$ be the hypothesis that $$w$$ is labelled by $$i$$, i.e. $$w$$ is a child of $$v_i$$ where $$v_i$$ is the vertex at level $$h_0$$ labeled by $$i$$. Note that the distribution of $$w$$ given the true $$v_i$$ is that of a noisy copy with parameter $$\theta^{h – h_0}$$. Thus with probability at least $$1 – k \theta^{h – h_0}$$ all of the coordinates of $$w$$ have been resampled from the uniform distribution. By a coupling argument, it follows that any hypothesis test labels correctly with probability at most $(1 – k\theta^{h – h_0}) (1/d) + k \theta^{h – h_0} \le (1/d) + k\theta^{h – h_0}.$

## Shallow Algorithms

Next we want to show that certain shallow algorithms also cannot recover as well as the “deep” algorithm we presented first. Formally, we define a shallow algorithm to be one that is given access only the following data:

1. For each label $$a$$ and each position $$i$$ is given $c_i(a,x) = \#\{\text{leaves \nu in S labelled by a such that \sigma(v)_i = x}\}$
2. all unlabeled data

In other words, we have summarized the labeled data by count statistics and do not let the algorithm look at higher order moments of the labelled data.

We consider the family of input instances where the direct children of the root are labeled $$1, \ldots, d$$ and in each of their subtrees, every leaf is labeled except the rightmost child. Then we have the following theorem:

If $$d \theta^2 < 1$$, a shallow algorithm requires $$k \ge (1/\theta)^c \ge \Omega(|T|^{c'})$$ samples in order to label correctly with probability greater than $$2/d$$. This is true even if $$\Sigma = Id$$ and $$\Pi = Id$$.

We do not present the whole proof of the theorem but just observe that it follows from known lower bounds on count reconstruction.

Finally we present an open problem: The lower bounds should be much better when $$\Pi$$ are random and the $$\Sigma_j$$ mix different elements. More specifically, the bound for shallow algorithms should hold if $$\theta = 1 – c^h$$ for some $$c< 1$$ ‘for some $$c < 1$$, instead of requiring $$\theta < 1/\sqrt{d}$$ when $$\Pi$$ are random and $$\Sigma$$ “mixes” well (e.g. $$\Sigma_r(x) = x + 2^r$$).

Posted on

## Mathematics of Deep Learning: Lecture 7 – Recovering Tree Models

Transcribed by Paxton Turner (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

# Tree Reconstruction II

We prove Claim 2 which was left over from last time.

Fix a constant $$\ell \in \mathbb{N}$$. Given a $$d$$-ary tree with percolation parameter $$\lambda < 1$$ which is sufficiently large, then $\mathbb{P}_{\lambda}\left[\text{root is in an} \, \ell-\text{diluted} \, (d^{\ell} - 1) \, \text{-ary open tree}\right] \geq 1 - \epsilon$
Let
$f(p) := \mathbb{P}[\text{Bin}(d^\ell, p) \geq d^{\ell} – 1],$
and note that $$f$$ is monotone in $$p$$. We see that $$f(1) = 1$$, and $$f(p) = p^{d^\ell} + d^{\ell} (1-p) p^{d^{\ell} – 1}$$. Also
$f'(1) = d^\ell + d^\ell(d^\ell – 1 – d^\ell) = 0.$
Thus taking $$p^*$$ sufficiently close to 1 yields that $$f(p) > p$$ for all $$p^* < p < 1$$. Also choose $$p^*$$ to satisfy $$p^* > 1- \epsilon$$. Now choose $$\lambda <1$$ such that $q = \mathbb{P}[\text{root is connected to all} \, d^\ell \, \text{nodes in the } \, \ell \, \text{level}] \geq p^*/f(p^*).$ Now we apply a recursive argument. Let $p_i = \mathbb{P}[\text{root is in an} \, \ell-\text{diluted} \, (d^\ell - 1) \, \text{-ary open tree for} \, i \cdot \ell \, \text{levels}].$ We will show by induction that $$p_r \geq p^*$$ for all $$r$$, which implies the lemma. Note that $$p_0 = 1$$ and $p_{r+1} \geq q \cdot \mathbb{P}\left[\text{Bin}(d^\ell, p_r) \geq d^{\ell} - 1\right] = \frac{p^* f(p_r)}{f(p^*)} \geq p^*$ since $$f$$ is monotone and $$p_r \ge p^*$$ by the induction hypothesis. The lemma is proved.

# Recovering the Tree Structure

We begin with an $$\ell$$-level full binary tree and the size of the alphabet to be $$q = \infty$$. We generate $$k$$ independent samples of the broadcasting procedure with parameter $$\lambda$$ on the tree.

In contrast to the last lecture, today, we do not know the tree structure , i.e. we only receive the colors of the nodes and do not know which nodes are siblings, cousins, etc. Our goal is to determine how large $$k$$ must be to recover the tree structure.

In practice, these trees are phylogenetic trees, and the information we have are the DNA/amino acid sequences. Thus, usually $$q = 4$$, or $$q = 20$$.

Question : How large should $$k, \ell, \lambda$$ be so that we recover the tree correctly as $$\ell \to \infty$$?

## Recovering Distances Using Correlations

If $$u,v$$ are graph distance $$2r$$ apart, then
$\mathbb{P}[\sigma(u) = \sigma(v)] = \lambda^{2r}.$
How large should $$k$$ be so that all distances can be computed accurately? Trivially, if $$k \lambda^{2\ell – 2} = 1/2$$ and $$d(u,v) = 2 \ell – 2$$, then taking a union bound gives
$\mathbb{P}[\exists \, \text{sample with} \, \sigma(v) = \sigma(u’)] \leq 1/2.$
Then we can’t say if $$d(u,v) = 2 \ell$$ or $$d(u,v) = 2 \ell – 2$$. If we use concentration, then
$k \geq \log(2^\ell) \left(\frac{1}{\lambda^\ell}\right)^2 \log(1/\delta)$
allows us to compute distances correctly with probability $$\geq 1 – \delta$$. One can show this with Chernoff bounds, and the details are omitted here. If we compute all of the distances between the leaves correctly, then we trivially can recover the tree structure. However, as noted above, this requires at least $$k \geq c (1/\lambda)^{2 \ell} = |T|^\beta$$ samples.

## Recursive “Deep” Algorithms for $$q = \infty$$

Computing all of the distances between the leaves seems like too strong of a requirement. This raises the question: Is there a deep algorithm that tries to infer the labels at the internal nodes sample by sample that can recover the tree with fewer samples? We will outline such an algorithm when $$2 \lambda > 1$$.

After detecting siblings, we want to understand the probability of being able to recover colors at their parents sample by sample. Consider first a small example. Suppose we have a 2-level binary tree with leaves labeled. If there is a 1 in both the left and right branches, then with probability 1, the root is colored 1. However, if no color appears in both branches, we mark the node with a question mark $$?$$. Returning to our original problem, we can first apply this procedure to identify siblings among the leaves and the labels of their parents sample by sample, and then apply this to the nodes at level $$\ell – 1$$, and so on.

If $$2 \lambda > 1$$, then there exists $$C(\lambda) > 0$$ such that
$\mathbb{P}[? \, \text{recovered at level} r] \leq 1 – C(\lambda).$
Note that an internal node is not a ? if its label survives in its left and right subtrees. Therefore
\begin{align*}
\mathbb{P}[\text{a node at level } r \, \text{is not} \, ?] &\geq \mathbb{P}[\text{root connected to 2 children}] \\
&= \mathbb{P}[\text{branching process survives}]^2 \\
&= C(\lambda) > 0
\end{align*}
if $$2 \lambda > 1$$.

How many samples are needed to identify all siblings correctly?
$k= \frac{(1 – \lambda)^4}{\lambda} \log \left( \frac{2^\ell}{\delta}\right)$
suffices for error $$\leq \delta$$. It turns out that there is a similar bound for recovering the tree if $$2 \lambda > 1,$$. As observed above, if we purely used distance methods, then we require $$k \geq c (1/\lambda)^{2 \ell} = |T|^\beta$$. However, if $$2 \lambda > 1$$, then our “deep” algorithm requires only $$k = \log(|T|)$$ samples.

If $$2 \lambda < 1$$, then what is true? The goal is to show you can't distinguish between the uniform distribution. Heuristically, we must estimate the probability of getting all question marks $$?$$ for the lower bound.

Lower bound proof idea: Start with four trees having roots $$a, b, c,$$ and $$d$$. We want to couple the broadcasting procedure on these trees so that they produce the same data. In such a coupling, we can show
$\mathbb{P}[\text{generating different samples}] \leq 4(2 \lambda)^\ell.$
Thus,
$\mathbb{P}[\text{coupling succeeds}] \geq 1 – 4k(2 \lambda)^\ell.$
Thus if we want recover the tree correctly with probability $$\geq 1/\delta$$, we need $$k \geq c(2 \lambda)^\ell = \Omega(|T|^{\ell})$$ samples (if $$\lambda < 1$$).

If $$2 \lambda < 1$$, distance methods do not give optimal sample complexity.

We make the following general conjecture.

Conjecture : Recursive methods should give optimal complexity.

## A Recursive Algorithm for $$q = 2$$

Next we consider a new model. Now we have a full binary tree, and on all edges, we copy with probability $$\eta$$. Our alphabet now has size $$q = 2$$, and in particular, we set the “letters” to be $$\{-1, 1\}$$.

Using distance methods, we cannot distinguish between
$\mathbb{E}[\sigma(u) \sigma(v)] = \eta^{2 \ell}, \quad \mathbb{E}[\sigma(v’) \sigma(v)] = \eta^{2 \ell – 2}$

The lower bound for distance methods here is $$k \geq c/\eta^{2 \ell}$$, and this is more or less tight. As before, we will instead use a recursive approach: first we identify siblings, and then estimate colors of the parents. To identify parents, use the correlations
$\mathbb{E}[\sigma(v) \sigma(v’)] = \eta^2$
if $$v$$ and $$v’$$ are siblings. We use correlations to recover distances up to a constant $$i$$ levels above. And this requires $$\approx \log(2^\ell)$$ samples to concentrate. Then we can use majority vote to recover the color of the parent. By a similar computation to the 2nd moment method from last lecture, if $$2 \eta^2 > 1$$,
$\mathbb{P}[\text{Maj}(x_1, \ldots, x_{2^r}) = x_{\text{root}}] \geq \frac{1}{2} + \epsilon$
for all $$r$$. Thus we define
$\hat{\sigma}_i = \text{Maj}\left(\sigma_i(u): u \, \text{leaf below} \, v \right).$
Note that
$\mathbb{E}[\sigma_{\text{root}} \hat{\sigma}_{\text{root}}] = \rho > 0 \Rightarrow \mathbb{E}[\hat{\sigma}_i(v) \hat{\sigma}_i(w)] = \eta^{2d(v,w)} \rho^2.$
Also note that $$\rho = \Theta(\epsilon)$$.

Results If $$2 \eta^2 > 1$$, we can recover the tree with $$k = \log(|T|)$$ samples. Otherwise, $$k \geq |T|^\beta$$. Compare this to the thresholds $$d \eta^2 > 1$$ for count reconstruction and $$d \lambda_2^2 > 1$$ for general broadcasting.

## Four point method for detecting distances

We define a new twist on the previous model. The parameter $$\eta$$ is now non-uniform, but we still require it to satisfy $$2 \eta^2(e) \geq 1 + \epsilon$$ for all edges $$e$$. In this general set-up, one has to be more careful with detecting distances/siblings.

Four point method : This is a trick from phylogeny. For the purposes of detecting distances, there are 3 possible 2 level binary trees. The neighbors are either $$\{(A, B), (C,D)\}$$, $$\{(A, C), (B, D) \}$$, and $$\{ (A, D), (B, C) \}$$. We use pairwise distances to determine which tree it is. Set
$Dist(u,v) = \log\left( \frac{1}{\mathbb{E}[\sigma_u \sigma_v]}\right).$

The correct pairing of $$A, B, C$$ and $$D$$ is the one for which
$Dist(X,Y) + Dist(W,V)$
is minimal, where $$(X, Y, W, V)$$ is some permutation of $$(A, B, C, D)$$.

Now to determine siblings of $$v$$, find all vertices with
$|Dist(u,v)| \leq 3 \log(1/\sqrt{2}),$
and apply the four point method on these vertices.

One can also run into an issue when estimating the colors of parents. To solve this, use an $$\ell$$ level recursive estimator.

If $$2 \eta^2 > 1$$, then there exists $$\ell(\eta)$$ such that using an $$\ell$$-level recursive majority gives estimator $$\hat{\sigma}$$ such that
$\mathbb{E}[\sigma \hat{\sigma}] \geq 1/2 + \epsilon.$

In the next lecture, we discuss hierarchical generative models.

Posted on

## Mathematics of Deep Learning: Lecture 6 – Simple hierarchical models

Transcribed by Govind Ramnarayan (edited by Mathew Brennan, Asad Lodhia and Elchanan Mossel)

# Some Very Simple (Gaussian) Hierarchical Models

We start by describing some very simple hierarchical models on the $$d$$-ary tree . Recall that in the $$d$$-ary tree, each node in the tree has exactly $$d$$ children (As opposed to the $$d$$-regular tree, in which the root has $$d$$ children, and every other node has $$d-1$$ children). Both models will feature an unknown character at the root, chosen according to some distribution. Each node propagates its character to its children with some Gaussian noise. The goal will be to recover the character at the root given the characters at some level $$\ell$$.

## Version 1 (Brownian Motion)

We will denote the character at the vertex $$v$$ by $$X_v$$, with the root being denoted as 0.

$$X_0$$ will be distributed according to an unknown distribution $$\mu$$, with
$X_v = X_0 + \sum_{w \in \text{path}(0,v) \backslash \{0\}} \sigma \mathcal{N}_w$
where $$\mathcal{N}_w$$ are i.i.d. $$N(0,1)$$ random variables, and $$\sigma$$ is a fixed parameter.

The goal is to estimate $$X_0$$ given $$\left\{ X_{v}: v \in L_{\ell}\right\}$$, the set of labels at level $$\ell$$. However, this model will not be the one we focus on today.

## Version 2 (Ornstein-Uhlenbeck)

This is similar to the above model, but instead $$X_0 \sim N(0,1)$$ and if $$v$$ is a parent of $$w$$ then
$X_w = \eta X_v + \sqrt{1 – \eta^2} \mathcal{N}_w$
where $$\mathcal{N}_w$$ are i.i.d $$N(0,1)$$.

An important observation about this model is that $$X_w \sim N(0,1)$$ for all nodes $$w$$ .

Just like in the previous model, we want to estimate $$X_0$$ given $$\left\{X_v: v \in L_{\ell}\right\}$$. The question is, how should we do it? A natural attempt is to take the average of the children, which we now analyze.

Note that $$(X_0, X_v: v \in L_\ell)$$ is jointly Gaussian. This means that the estimate that minimizes the $$\ell_2$$ error
$\mathbb{E}\left[X_0 \Big|(X_v)_{v \in L_{\ell}}\right]$
can be written as $$\sum_{v \in L_{\ell}} a_v X_v$$, by the definition of being jointly Gaussian. Also, by symmetry of the model, all the $$a_v$$’s are the same, so we can write this as $$a \sum_{v \in L_{\ell}} X_v$$.

So to find the correct estimate for $$X_0$$, it suffices to find the value of $$a$$ that minimizes
$\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right]$
which we now proceed to do. We begin by computing the mean squared error
\begin{align*}
\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right] &= \mathbb{E}[X_0^2] + a^2\sum \mathbb{E}[X_v^2] – 2ad^{\ell}\text{Cov}(X_0, X_v) + a^2 \sum_{v \neq w \in L_{\ell}} \text{Cov}(X_v, X_w) \\
&= 1 + d^{\ell}a^2 – 2ad^{\ell}\eta^{\ell} + a^2 d^{\ell} \sum\limits_{j=1}^{\ell} \eta^{2j} (d-1)d^{j-1} \\
&= 1 + d^{\ell}\left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot
\frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right)
\end{align*}
The second line follows from expanding out the sum and the fact that there are $$d^{\ell}$$ nodes at level $$\ell$$. The third line follows from the fact that the covariance of $$X_v$$ and $$X_w$$ is dependent on the distance between them in the underlying tree: namely, if the distance between them is $$D$$, then $$\mathbb{E}[X_v X_w] = \eta^D$$. Since all the vertices in the final sum are at the level $$\ell$$, they are at an even distance on the tree (the distance is $$2j$$, where $$j$$ denotes the distance between the vertices and their lowest common ancestor on the tree). The fourth line follows from the formula for computing the sum of a geometric series.

We can compute the value of $$a$$ that minimizes
\begin{equation}
1 + d^{\ell} \left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot \frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right)
\end{equation}
by taking the derivative with respect to $$a$$ (this can also be accomplished approximately by setting the two terms in the sum to be equal and solving for $$a$$). Doing so, we get that
\begin{equation}
a = \eta^{\ell} \left(\frac{d-1}{d} \cdot \frac{(d\eta^2)^{\ell}}{d\eta^2 – 1} \right)^{-1}
\end{equation}
Now we realize that the optimal squared error $$\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right]$$ at this value of $$a$$, as given by the computation above, behaves very differently when $$d \eta^2 < 1$$ versus when $$d \eta^2 > 1$$. When $$d \eta^2 < 1$$, we notice that $$\frac{(d \eta^2)^{\ell + 1} - 1}{(d\eta^2) - 1}$$ is approximately a constant. Furthermore, the optimal value of $$a$$ is roughly $$\eta^{\ell}$$, and so our expression for the optimal squared error becomes $\mathbb{E}\left[\left(X_0 - a \sum X_v\right)^2\right] \approx 1 + d^{\ell}a(-2\eta^{\ell} + a \Theta(1)) \approx 1 + d^{\ell}\eta^{2\ell} \approx 1$ Note also that a squared error of 1 would be accomplished by simply ignoring the leaves and estimating $$X_0$$ by 0.

By contrast, when $$d\eta^2 > 1$$, we argue that the optimal squared error is less than 1. Plugging in the minimizing value of $$a$$ gives that
\begin{align*}
\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right] &= 1 + d^{\ell} \left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot \frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right) \\
&= 1 + d^{\ell}(-2a\eta^{\ell} + a\eta^{\ell}) \\
&= 1 – d^{\ell}\eta^{\ell} \left( \eta^{\ell} \frac{d}{d-1} \frac{d\eta^2 – 1}{(d\eta^2)^{\ell + 1} – 1} \right) \\
&\leq 1 – \frac{(d\eta^2)^{\ell}}{(d\eta^2)^{\ell + 1}} \cdot \frac{d}{d-1} \cdot (d\eta^2 – 1) \\
&< 1 \end{align*} In summary, we have a threshold at $$d\eta^2 = 1$$ for this simple model. If $$d\eta^2 > 1$$, we have shown that we can estimate better than random with the optimal linear estimator. When $$d\eta^2 < 1$$, we know that we cannot do better than random. But this model was a linear Gaussian model, which is not at all similar to the type of structure posed by deep learning (specifically, we know that deep learning has non-linearities that are applied at each layer, whereas the entire model here was linear). We’ll introduce some non-linearities in the next model to try to make it more interesting, and analyze the resulting behaviour.

# Model 3: “Threshold” Version of Model 2

In this model we deal with the same $$d$$-ary tree as before, but the initial distribution of $$X_0$$ and the noise model we use will change.

Specifically, $$X_0$$ will be distributed uniformly in $$[q] = \{ 1,\ldots, q\}$$ and if $$w$$ is the parent of $$v$$, then
$X_v = \begin{cases} X_w & \text{ w.p. \eta} \\ \text{Unif}[q] & \text{ w.p. } 1 – \eta \end{cases}$

An alternative view of the model is that each edge in the $$d$$-ary tree is erased with probability $$1 – \eta$$. Then, the root of each component get a random symbol from $$[q]$$, and within each subtree the character is copied. So the labels within a component are the same, and the labels across components are independent.

Our goal is as follows.

Goal: Estimate $$X_0$$ from $$\left(X_v: v \in L_{\ell}\right)$$

Before continuing, we note that the error of any estimator is always at least $$\Theta(1)$$ (where we are thinking of $$d$$ and $$\eta$$ as constants), since there is a chance that there is no copying at the first level! But this was also a problem for Model 2.

A First Attempt (Plurality): One naive estimator is to take the value $$a \in [q]$$ that is most common among the $$\left(X_v : v \in L_{\ell}\right)$$ (i.e., taking the plurality vote amongst the leaves at level $$\ell$$). This is one of the most natural attempts, but is this the optimal estimator? We now analyse (a variant of) this estimator.

We’ll analyse a variant of the plurality estimator. Let
$y_v = \begin{cases} 1 – 1/q & \text{ if } X_v = 1 \\ -1/q & \text{ if } X_v \neq 1 \end{cases}$
The estimator we will use is the following:

1. If $$S := \sum_{v \in L_\ell} y_v > 0$$, then output 1.
2. Otherwise, output a random character from $$\{ 2,\ldots, q\}$$.
The estimator described above outperforms a random estimator if and only if plurality does.

We won’t formally prove this claim now – it will follow trivially from the results in the rest of lecture. However, intuitively $$S$$ synchronizes with the plurality estimator when the root symbol is 1, and is random otherwise. If plurality does well, it should do well when the root symbol is 1 by symmetry. It is clear that in this case, $$S$$ succeeds whenever plurality succeeds, so it outperforms the random estimator. If $$S$$ works correctly, it means that 1 appeared more than $$\frac{1}{q}$$ times at the leaves very frequently. The only way plurality could work poorly in this case is if the plurality element is not 1 on many of these occasions. However, by symmetry, every non-root element has lower probability of showing up at the leaves than the root element, so this will only happen if, for many of those times, 1 showed up in just a little more than $$\frac{1}{q}$$ fraction of the leaves, but was beaten by a non-root symbol due to random deviation. However, intuitively, in this case 1 would not have showed up more than $$\frac{1}{q}$$ times so often, because random deviation would often make it go below $$\frac{1}{q}$$. We will indeed see that this is the case in the remainder of lecture.

First we note a trivial claim about when we cannot hope to recover $$X_0$$ better than random asymptotically.

[Trivial Lower Bound] If $$d \eta < 1$$ no estimator can do better than random asymptotically.
If $$d \eta < 1$$, then the fraction of leaves that are connected to the root tends to zero. Therefore, no correlations between the leaves and the root will survive.

How do we analyse this estimator? We’ll use the following three facts:
$\mathbb{E}\left[S | X_0 = 1\right] = \mathbb{E}\left[\text{v connected to root}\right] \cdot \left(1 – \frac{1}{q}\right) = (d \eta)^{\ell} \left(1 – \frac{1}{q}\right)$
$\mathbb{E}\left[S | X_0 \sim U\right] = 0$
which follows from the fact that each $$y_v$$ is an unbiased random variable. Here $$U$$ denotes the uniform distribution on $$[q]$$. Also,
$\text{Var}(S) \approx \Theta(d^{\ell})$
which follows from a direct computation very similar to the calculation in the previous model, where we minimized the $$\ell_2$$-error.

From the above three observations, we can conclude that there is some $$\epsilon > 0$$ such that
$d \eta^2 > 1 \Rightarrow \frac{\mathbb{E}\left[S | X_0 = 1\right] – \mathbb{E}\left[S | X_0 \sim U\right]}{\sqrt{\text{Var}(S)}} > \epsilon$
and furthermore that
$d\eta^2 < 1 \Rightarrow \frac{\mathbb{E}\left[S | X_0 = 1\right] - \mathbb{E}\left[S | X_0 \sim U\right]}{\sqrt{\text{Var}(S)}} \to 0$ As we will see, in the case where $$d \eta^2 > 1$$, we will be able to recover the root better than random by using the estimator $$S$$.

[Second Moment Argument]
If $$\frac{\mathbb{E}\left[S | X_0 = 1\right] – \mathbb{E}\left[S | X_0 \sim U\right]}{\sqrt{\text{Var}(S)}} > \epsilon$$, then we can recover $$X_0$$ using $$S$$ better than random.
Let $$\mu$$ denote the measure of $$S$$ given a random character at the root, and let $$\mu^+$$ denote the measure of $$S$$ given that the root is 1. Furthermore, define $$f$$ such that $$d\mu^+ = f d\mu$$, and of course by definition $$\mu = 1d\mu$$. Then
\begin{align*}
\left(\mathbb{E}_{\mu^+}[S] – \mathbb{E}_{\mu}[S]\right)^2 &= \left( \int S (d\mu^+ – d\mu) \right)^2 \\
&= \mathbb{E}_{\mu}\left[S(f-1)\right]^2 \\
&\leq \mathbb{E}_{\mu}\left[S^2\right] \cdot \mathbb{E}_{\mu}\left[(f-1)^2\right]
\end{align*}
by Cauchy-Schwarz. Furthermore, we can upper bound the expression $$\mathbb{E}_{\mu}\left[(f-1)^2\right]$$ by noting that $$\mathbb{E}_{\mu}\left[(f-1)^2\right] \leq d_{\text{TV}}(\mu, \mu^+)$$, which follows since $$(\mu(v) – \mu^+(v))^2 \leq \left|\mu(v) – \mu^+(v)\right|$$ for all $$v$$. Finally, we can use this to lower bound the total variation distance between $$\mu$$ and $$\mu^+$$,
$d_{\text{TV}}(\mu, \mu^+) \geq \frac{\left(\int S (d\mu -d\mu^+)\right)^2}{\mathbb{E}_{\mu}\left[S^2\right]} \geq \epsilon^2$
where in the final step we used our assumption.

In conclusion, this estimator performs very similarly to the scenario for the “boring” case of Model 2! But wait:

1. We didn’t prove that the estimator fails if $$d \eta^2 < 1$$.
2. Maybe there’s a better estimator. Maybe even one that works for $$d \eta > 1$$?

These questions are addressed by the following two theorems.

1. If $$q=2$$, then $$d\eta^2 = 1$$ is the threshold (and $$q=3,4$$ behave similarly).
2. If $$q \geq 5$$, then $$d\eta^2$$ is not the threshold (but we have no simple formula for the threshold!)

In the theorem below, we define count reconstruction estimators to be estimators that just depend on the number of each color $$q$$ among the symbols at the leaves of the tree.

Count reconstruction estimators will always fail for $$d \eta^2 < 1$$.

In particular, $$d\eta^2 = 1$$ is the threshold for all sane count reconstruction estimators, like the one we have already described.

In the interest of time, we won’t actually prove both of these theorems. We will sketch some easy versions of a “proof” of the first of these two theorems.

[“Proof” that $$d\eta^2$$ is not threshold for $$q \geq 5$$]

Baby version ($$q = \infty$$): We first handle this easier special case. When $$q = \infty$$, the labels in the model can be constructed as follows:

1. Color root randomly (with some arbitrary color).
2. Copy the color with probability $$\eta$$.
3. Otherwise, choose a completely new color that has not been seen previously and use it.

Now consider the estimator defined as follows: If two nodes $$u, v \in L_{\ell}$$ have the root as their lowest common ancestor and $$x_u = x_v$$, then return $$x_v$$, and otherwise return an arbitrary color.

1. Correctness: Any repeated color must have been given by a common ancestor in this model. For $$u$$ and $$v$$ as chosen, this ancestor must have been the root!
2. Probability of correctness: The probability that we get such leaves is at least the probability that two children of the root survive, and that the branching process survives for each of them.
$\mathbb{P}[\text{correct}] \geq \eta^2 \cdot \mathbb{P}[\text{branching process survives}]^2 \geq \epsilon$
where we used the fact that the branching process will survive asymptotically if $$d \eta > 1$$.

How do we extend this to large, finite $$q$$? First, we make a definition that will be relevant in our extension.

An $$r$$-diluted $$d$$-ary tree is a tree where every node at level $$\ell$$ must have at least $$d$$ descendents at level $$\ell + r$$ (if that level exists).

The extension to large, finite $$q$$ will result from the following two claims:

If $$d \eta > 1$$, there exist $$r, \epsilon > 0$$ such that with probability at least $$\epsilon$$, the root cluster contains an infinite $$r$$-diluted binary tree.
Given $$d, r, \epsilon$$, there exists a value $$\eta’ < 1$$ such that, for the branching process with copy probability $$\eta'$$, the probability that the root cluster contains an infinite $$r$$-diluted $$(d^{r}-1)$$-ary tree is at least $$1 - \frac{\epsilon}{10q}$$.

We will delay proving these two claims for now. Instead, we proceed to give a good estimator for the case where $$q$$ is large and finite, given these claims.

The Estimator: Look for a monochromatic, $$r$$-diluted binary tree of color $$i$$. If it exists, return color $$i$$.

By the first claim above, if the root is colored with $$i$$, then the probability that such a tree exists with color $$i$$ is $$\geq \epsilon$$. Now suppose that the root is not colored with $$i$$. Then we argue that there is no $$r$$-diluted binary tree of color $$i$$ with very high probability. Let $$v$$ be a child of a node $$w$$ in the tree. First we note that
\begin{align*}
\Pr[x_v \neq i | x_w \neq i] &= \eta + (1 – \eta)\left(1 – \frac{1}{q}\right) \\
&= 1 – \frac{1}{q} + \frac{\eta}{q} \underset{q \to
\infty}{\longrightarrow} 1
\end{align*}
By taking $$q$$ large enough such that $$1 – \frac{1}{q} + \frac{\eta}{q} \geq \eta’$$ where $$\eta’$$ is from the second claim above, we see that we cannot fit an $$r$$-diluted binary tree of color $$i$$ in the remainder of the tree (since there is no space for the necessary 2 leaves at any level!).

Now we’ll prove the second claim. Before we proceed, we note that this proof was given in detail at the beginning of the proceeding lecture, so the proof here reflects the proof done in that lecture.

[Proof of Second Claim]
We proceed by induction on $$k$$, the number of iterations. Let $$p_k$$ denote the probability that the Claim holds for a tree of depth $$k \cdot r$$ (that is, that the root cluster contains an $$r$$-diluted $$(d^{r}-1)$$-ary tree of depth $$k \cdot r$$ after percolation has occured for $$k \cdot r$$ levels). We will use induction to show that, if $$\eta’$$ is large enough, then there is some $$p^* \geq 1 – \epsilon$$ such that $$p_k \geq p^*$$ for all $$k$$.

Define the function $$f$$ as follows
$f(p) = \Pr[\text{Bin}(d^{r}, p) \geq d^{r} – 1]$

We can see that $$f(1) = 1$$, and furthermore that $$f$$ is monotone in $$p$$. Furthermore, we can compute $$f(p)$$ by simply counting.
$f(p) = p^{d^{r}} + d^{r}(1 – p)p^{d^{r} – 1}$
Now we compute the derivative of $$f$$ at $$p=1$$:
$f'(1) = d^{r} + d^{r}(1 – p)p^{d^{r} – 1} = 0$
From this, we can conclude that for all sufficiently large $$p^* < 1$$, we have that $f(p^*) \geq p^*$ We choose a $$p^* > 1 – \epsilon$$ and $$\eta’ < 1$$ such that $q = \Pr[\text{connected to } d^{r} \text{ descendents at distance } r] \geq \frac{p^*}{f(p^*)}$ Finally, we can proceed to show that $$p_k \geq p^*$$ for all $$k$$ by induction. We note that $$p_0 = 1$$, as the $$0$$-depth tree is just the root. Furthermore \begin{align*} p_{r+1} &\geq q \cdot \Pr[\text{Bin}(d^{\ell}, p_r) \geq d^{\ell} - 1] \\ &\geq \frac{p^*}{f(p^*)} \cdot f(p_r) \\ &\geq \frac{p^*}{f(p^*)} f(p^*) = p^* \end{align*} where the final inequality proceeds from the induction hypothesis that $$p_r \geq p^*$$ and the monotonicity of $$f$$ in $$p$$.

For the rest of the lecture, we will sketch the proof of the theorem that count reconstruction estimators will always fail for $$d \eta^2 < 1$$ (the word sketch should be re-emphasized here). Informally, this says that if $$d \eta^2 < 1$$, then we cannot reconstruct the character at the root better than random simply by looking at the count statistics on the leaves.

[Kesten-Stigum Theorem] If $$d \eta^2 < 1$$, then the count vector $$(c_1^{\ell}, \ldots, c_q^{\ell})$$ satisfies a CLT that does not depend on the root, where $$c_i^{\ell}$$ denotes the number of leaves at level $$\ell$$ with color $$i$$.

As a concrete example, we can use this CLT for the counts to give a CLT for our earlier count estimator $$S$$. Let $$\psi(i) = e_1(i) – \frac{(1, \ldots, 1)}{q}$$. Then we can see that $$y_v = \psi(x_v)$$, and we can apply the Kesten-Stigum Theorem for $$S_{\ell} = \sum_{v \in L_{\ell}} y_v$$.

We now give a high level sketch of why the Kesten-Stigum theorem works for this sum. The idea is to use the right martingale. We consider
exposing one node of the tree at a time, and subtracting the “expected value” of the node given its parent to keep the exposed variables zero mean. More formally, define
$S_{\ell}’ = \sum_{v: |v| \leq \ell} (d \eta)^{-|v|}(y_v – \eta y_{\text{parent}(v)})$
where $$|v|$$ denotes the level of the vertex $$v$$ in the tree (and $$y_{\text{parent}}(\text{root}) = 0$$). Note that $$\mathbb{E}\left[y_v | y_{\text{parent}(v)}\right] = \eta \cdot y_{\text{parent}(v)}$$, so whenever we add a vertex to the sum, the expected value of the sum remains the same. By induction, it follows that when we add an entire level to the sum, it remains the same. Hence, we can conclude that

1. $$S_{\ell}’$$ is a martingale.
2. $$S_{\ell}’ = \sum_{v: |v| = \ell} d^{- \ell} \eta^{- \ell} y_v$$

where the second item follows by induction and noting that the sum telescopes on adjacent levels. We know that $$S_0 = \sum_{v: |v| = 0} (d \eta)^{0} y_v$$, so the claim holds for $$\ell = 0$$. We now assume that $$S_{\ell}’ = \sum_{v: |v| = \ell} d^{- \ell} \eta^{- \ell} y_v$$ and prove the statement for $$\ell + 1$$.
\begin{align*}
S_{\ell + 1}’ &= \sum_{v: |v| \leq \ell+1} (d \eta)^{-|v|}(y_v – \eta y_{\text{parent}(v)}) \\
&= S_{\ell} + \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v – \sum_{v: |v| = \ell + 1} (d \eta)^{-\ell – 1} \eta y_{\text{parent}(v)} \\
&= S_{\ell} + \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v – \sum_{v: |v| = \ell} (d \eta)^{\ell} y_v \\
&= \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v
\end{align*}
Now, we would like to appeal to a Martingale Central Limit Theorem to establish that $$S_{\ell}’$$ satisfies a CLT that is independent from the root. Note that Azuma’s inequality would tell us that $$\lim_{\ell \to \infty} S_{\ell}’$$ is highly concentrated around $$S_0$$ if we knew that $$\sum\limits_{k=0}^{\infty} c_k^2 < \infty$$, where the $$c_k$$ denote the Martingale differences. In this case, the leaves would not satisfy a CLT independent of the root.
\begin{align*}
\sum_{\ell’ \leq \ell}(S_{\ell’} – S_{\ell’ – 1})^2
&\geq \sum_{\ell’ \leq \ell} d^{\ell’}(d \eta)^{- 2 \ell’} \Theta(1) \\
&= \Theta(1) \sum_{\ell’ \leq \ell} \left(\frac{1}{d\eta^2}\right)^{\ell’}
\end{align*}
If $$d \eta^2 < 1$$, this series diverges, which means that $c$ satisfies a CLT independent of the root. However, this is not quite enough; a CLT tells us a statement of the form $$\mathbb{P}[\sum X_i \in [a - c\sqrt{n}, a + c\sqrt{n}]]$$, but just knowing that the counts are concentrated in an interval does not rule out the possibility of an estimator that can reconstruct the root from the parity of $$c_1^{\ell}$$ (or similar functions that are not constrained by being in an interval). We note that it does not really make sense for any sane estimator to use the parity when trying to reconstruct the root, but we still have to rule it out! So we will additionally need a Local Central Limit Theorem, of the form

[Local CLT] Let $$X_i$$ be iid integer valued and not supported on an arithmetic progression with stride $$\geq 2$$. Then we have that
$\Pr\left[\sum\limits_{i=1}^n X_i = m\right] = \frac{1}{\sigma \sqrt{2 \pi}} \text{exp}\left(-\left(\frac{m – n\mu}{2 \sigma \sqrt{n}}\right)^2\right) + o(1/\sqrt{n})$
where $$\sigma^2$$ is the variance of $$X_i$$ and $$\mu$$ is the expectation.

The local CLT helps us rule out “parity-like” estimators that still only depend on the counts at the leaves.

Posted on

## Mathematics of Deep Learning: Lecture 5 – Random Sparse Nets etc.

Transcribed by Jiaoyang Huang (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

For today’s lecture we will present two papers. The first one is Provable Bounds for Learning Some Deep Representations by Arora, Bhaskara, Ge and Ma. They proved that random sparse neural networks are learnable. The second paper is Train faster, generalize better: Stability of stochastic gradient descent by Hardt, Recht and Singer. Their main results state that any model trained with stochastic gradient method in a reasonable amount of time attains small generalization error.

# Provable Bounds for Learning Some Deep Representations

The sparse random network model has $$\ell$$ layers of vectors of binary variables $$h^{(\ell)}, h^{(\ell-1)},\cdots, h^{(1)}$$ (where $$h^{(\ell)}$$ is the top layer) and an observed layer $$h^{(0)}$$ at bottom. Each layer has $$n$$ nodes.
\begin{align}
\end{align}
The activation function is the sgn function, i.e.
\begin{align}
\sigma(x)=\left\{
\begin{array}{cc}
1 & x>0,\\
0 & x=0,\\
-1 & x<0. \end{array} \right. \end{align} The network between two layers is a random graph with average degree $$d=n^{\gamma}$$ with $$0<\gamma\leq 0.2$$. More precisely, each node at level $$k$$ is connected to each node at level $$k-1$$ with probability $$n^{\gamma-1}$$ independently. The edge weights are random in $$\{\pm 1\}$$ with probability $$1/2$$ each. For any vertex $$i$$ in the $$k$$-th layer, and vertex $$j$$ in the $$(k-1)$$-th layer, the weight $$W_{ji}^{k-1}\in \{-1,0,1\}$$ is such that \begin{align} W_{ji}^{k-1} =\left\{ \begin{array}{cc} 0 & \text{there is no edge between $$i,j$$.}\\ -1 & \text{there is an edge with weight $$-1$$ between $$i,j$$.}\\ 1 & \text{there is an edge with weight $$1$$ between $$i,j$$.} \end{array} \right. \end{align} We remark that $$W^{k-1}$$ is the weighted adjacency matrix between $$(k-1)$$-th and $$k$$-th layers. The samples are generated from top to bottom. Given $$h^{(k)}$$, the vector in the $$k$$-th layer, the vector in the $$(k-1)$$-th layer is $$h^{(k-1)}=\mathrm{sgn}(W^{k-1}h^{(k)})$$, \begin{align} h^{(k-1)}_j=\mathrm{sgn}\left(\sum_{i=1}^n W^{k-1}_{ji}h^{(k)}_i\right),\quad j=1,2,\cdots, n. \end{align} We sample the top layer $$h^{(\ell)}$$, where the set of nodes that are $$1$$ is picked uniformly among all sets of size $$\rho_\ell n$$. $$\rho_\ell$$ is called the density of the top layer, and we assume that $$\rho_\ell (d/2)^\ell=O(1)$$. Since averagely, each node has $$d$$ children nodes, the density at level $$k$$ is roughly of size $$\rho_\ell d^{\ell-k}\ll1$$, and $$h^{(k)}$$ is also a sparse vector. The main theorem we will prove is the following:

When degree $$d=n^{\gamma}$$, for $$0<\gamma\leq 0.2$$, with density $$\rho_\ell (d/2)^{\ell}=O(1)$$, the network model can be learned with high probability using $$O(\log n/\rho_\ell^2)$$ samples and $$O(n^2\ell)$$ time.

Thanks to the sparsity, the net should approximately preserve information, and allows easily going back/forth between representations in two adjacent layers. In the following we first focus on learning a single-layer network. It has a top layer with vector $$h^{(1)}$$ and a bottom layer with vector $$h^{(0)}$$. We assume that $$h^{(1)}$$ is sampled among sparse vectors, and then $$h^{(0)}$$ is generated using $$h^{(1)}$$. To learn a single-layer network, we need first to recover the network structure. Then for each sample, we need to recover $$h^{(1)}$$ from $$h^{(0)}$$.

Given a single-layer network (with known graph structure), the property that we can recover $$h^{(1)}$$ from $$h^{(0)}$$ is characterized by the following definition:

Fix an activation function. An autoencoder consists of a decoding function $$D(h)=s(Wh+b)$$ and an encoding function $$E(h)=s(W’h+b’)$$ where $$W,W’$$ are linear transformations, $$b,b’$$ are fixed vectors and $$s$$ is an activation function, which acts on each coordinate. The autoencoder is said to be well denoising if $$E(D(h)\oplus \xi)=h$$ with high probability with respect to the noise $$\xi$$ and the randomness of $$h$$, where $$D(h)\oplus\xi$$ means that $$D(h)$$ is corrupted by noise $$\xi$$. The autoencoder is called weight tying if $$W’=W^T$$.

We will show that a single-layer random network is a denoising autoencoder if the input layer and noise are both sparse.

A one layer network with density $$\rho$$ satisfying $$\rho d<0.1$$ is a denoising autoencoder with high probability, with respect to noise that flips every output bit independently with probability $$<0.1$$. Furthermore this autoencoder is weight tying.
We denote $$W$$ the weighted adjacency matrix between the top layer and the bottom layer. By definition, the one layer network implements the encoder $$E(y)=\textrm{sgn}(W^Ty)$$. We will show that the one layer network is an autoencoder with decoder $$D(h)=\textrm{sgn}(Wh)$$ with high probability. Note that this will also imply it is weight tying.

Let $$S$$ be the support of the top layer $$h^{(1)}$$, and $$S’$$ be the support of the bottom layer $$h^{(0)}$$. The key property for the sparse network is that most neighbors of an activated node are unique. More precisely, fix $$i$$ in the top layer, and $$j\in {\cal N}(i)$$ in the bottom layer, where $${\cal N}(i)$$ is the set of neighbor vertices of $$i$$, i.e. $$W_{ji}\neq 0$$. Thanks to the sparsity assumption, with high probability, $$j$$ has no non-zero neighbor other than $$i$$, i.e.
\begin{align}
{\mathbb P}(|{\cal N}(j)\cap S\setminus \{i\}|\geq 1)\leq \rho d<0.1. \end{align} If $${\cal N}(j)\cap S\setminus \{i\}=\emptyset$$, and $$h_j^{(0)}$$ was not flipped by the noise, then $$h_{i}^{(1)}=W_{ji}h_j^{(0)}$$. Moreover, \begin{align} {\mathbb P}({\cal N}(j)\cap S\setminus \{i\}=\emptyset {\text{ and $$h_j^{(0)}$$ was not flipped by the noise}})< 0.2. \end{align} Motivated by these observations, we can recover $$h^{(1)}$$ using the following quantity, \begin{align} I_{i}=\sum_{j\in {\cal N}(i)}W_{ji}h_j^{(0)}. \end{align} If $$|I_i|\geq d/2$$, then we set $$h_i^{(1)}=sgn(I_i)$$. Otherwise, if $$|I_i| < d/2$$, we set $$h_i^{(1)}=0$$.

In the following we focus on learning the structure of a single-layer network. We recall that $$h^{(1)}$$ is sampled among sparse vectors (with density $$\rho$$), and then $$h^{(0)}$$ is generated using $$h^{(1)}$$. The graph structure can be recovered using $$\mathrm{poly}(n)$$ samples.

Given $$\mathrm{poly}(n)$$ samples, one can recover $$W$$ the weighted adjacency matrix (between the top layer $$0$$ and $$1$$).
We define a $$\sim$$ relation among vertices in the bottom layer: call two nodes $$i_1,i_2$$ at the bottom layer related, denoted by $$i_1\sim i_2$$, if they are connected to the same node at the top layer. We can identify related nodes by their correlations. If $$i_1\sim i_2$$, then they have common neighbors. With probability $$\rho$$, one of their common neighbors is non-zero, so
\begin{align}
\mathbb E\left[h_{i_1}^{(0)}h_{i_2}^{(0)}\right]\approx \rho.
\end{align}
If $$i_1\not\sim i_2$$, then they do not have common neighbors. With good probability, at least one of them is zero
\begin{align}
\mathbb E\left[h_{i_1}^{(0)}h_{i_2}^{(0)}\right]\approx 0.
\end{align}
Once we recovered the $$\sim$$ relations, the recovering of the graph structure is straightforward. In fact we can name vertices in the top layer by largest clusters in the bottom layer. More precisely, each vertex in the top layer corresponds to a largest cluster $$A$$ in the bottom layer such that $$|A|\geq 0.1d$$ and for any $$i,j\in A$$, $$i\sim j$$. This approach can be strengthened to include some errors by identifying clusters that are different in less than $$0.1d$$ nodes, i.e. we identify $$A$$ and $$A’$$ if $$|A\Delta A’|\leq 0.1 d$$.

The high level network inference algorithm in the paper learns the network layer by layer starting from the bottom. For the network between $$i$$-th and $$(i+1)$$-th layers, the algorithm first recover the network structure between levels $$i$$ and $$i+1$$. Then it recovers $$h^{(i+1)}$$ from $$h^{(i)}$$ for each sample as above.

There is a technical issue. In the original model, we sample $$h^{(\ell)}$$ uniformly among all the sparse vectors, then sequentially generate $$h^{(\ell-1)}, h^{(\ell-2)}, \cdots, h^{(0)}$$. The bits of $$h^{(i+1)}$$ are not independent! We can not directly use the Lemma above to recover the network structure between levels $$i$$ and $$i+1$$. However, this problem can be resolved by showing that the bits of $$h^{(i+1)}$$ cannot be too correlated, unless they have a common ancestor. Moreover, the correlation decays exponentially, i.e.
\begin{align}
\text{grandparents correlation }\leq 1/d \text{ parent correlation.}
\end{align}

# Train Faster, Generalize Better: Stability of Stochastic Gradient Descent

Let’s consider the following general setting of supervised learning. There is an unknown distribution $$\cal D$$. We receive a sample $$S=(z_1,z_2,\cdots, z_n)$$ of $$n$$ i.i.d. examples drawn from $$\cal D$$. Let $$f(w, z)$$ denote the a loss function where $$w$$ specifies the model and $$z$$ denotes an example. Our goal is to minimize the population risk:
\begin{align}
R[w]:= {\mathbb E}_{z\sim \cal D}[f(w,z)],
\end{align}
over all possible models $$w$$. Since the distribution $$\cal D$$ is unknown, we cannot measure the quantity $$R[w]$$ directly, we instead minimize the empirical risk:
\begin{align}
R_S[w]:=\frac{1}{n}\sum_{i=1}^n f(w,z_i).
\end{align}
The generalization error of a model $$w$$ is the difference
\begin{align}
R_S[w]-R[w].
\end{align}
If $$w=A(S)$$ is chosen as a function of the data by a possibly random algorithm $$A$$, we need to estimate the expected generalization error
\begin{align}
\varepsilon_{gen}:=\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]].
\end{align}

In order to bound the generalization error, we recall the following definition of uniform stability,

The algorithm $$A$$ is $$\varepsilon$$-uniform stable if for all data sets $$S, S’$$ that differ at most one example:
\begin{align}
\sup_{z}\mathbb E_{A}[f(A(S),z)-f(A(S’),z)]\leq \varepsilon.
\end{align}
We define $$\varepsilon_{Stab}(A,n)$$ the smallest such $$\varepsilon$$.
The generalization error can be controlled by $$\varepsilon_{Stab}(A,n)$$:
\begin{align}
|\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]]|\leq \varepsilon_{Stab}(A,n).
\end{align}
Denote by $$S=(z_1,z_2,\cdots, z_n)$$ and $$S’=(z_1′,z_2′,\cdots,z_n’)$$ two independent random samples and let $$S^{(i)}=(z_1,\cdots, z_{i-1}, z_i’,z_{i+1},\cdots,z_n)$$. With these notations we have
\begin{align}\begin{split}
\mathbb E_{S}\mathbb E_{A}[R_S[A(S)]]
&=\mathbb E_{S}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S),z_i)\right]\\
&=\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S^{(i)}),z_i’)\right]\\
&=\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S),z_i’)\right]+\delta\\
&=\mathbb E_{S}\mathbb E_{A}\left[R[A(S)]\right]+\delta,
\end{split}\end{align}
where
\begin{align}
|\delta|&=\left|\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n \left(f(A(S^{(i)}),z_i’)-f(A(S),z_i’)\right)\right]\right| \\
&\leq \varepsilon_{Stab}(A,n).
\end{align}
This finishes the proof.

In the following we show that for stochastic gradient iterative algorithms, we can control their uniform stability, and obtain some upper bound on the generalization error.

Given the loss function $$f(w,z)$$, the stochastic gradient update with learning rate $$\alpha_t>0$$ is given by
\begin{align}
w_{t+1}=w_t-\alpha_t \nabla_w f(w_t, z_{i_t})
\end{align}
where the indices $$i_t$$ are either picked uniformly randomly in $$\{1,2,\cdots,n\}$$, or picked according to a random permutation.

In the rest of the lecture, we analyze the stochastic gradient descent with convex minimization. Let me first discuss a little bit of convexity,

If $$f$$ is convex and $$\nabla f$$ is $$\beta$$-Lipschitz then
\begin{align}\label{e:coercive}
\langle \nabla f(v)-\nabla f(w), v-w\rangle \geq \frac{1}{\beta}\|\nabla f(v)-\nabla f(w)\|_2^2.
\end{align}
First we prove that
\begin{align}\label{e:sconvex1}
f(y)\geq f(x)+\nabla f(x)^T(y-x)+\frac{1}{2\beta}\|\nabla f(y)-\nabla f(x)\|^2_2.
\end{align}
Then by symmetry, we have
\begin{align}\label{e:sconvex2}
f(x)\geq f(y)+\nabla f(y)^T(x-y)+\frac{1}{2\beta}\|\nabla f(y)-\nabla f(x)\|^2_2.
\end{align}
Our claim \eqref{e:coercive} follows by taking sum of \eqref{e:sconvex1} and \eqref{e:sconvex2}.

For the proof of \eqref{e:sconvex1}, we introduce a new function
\begin{align}
g(z)=f(z)-\nabla f(x)^T z.
\end{align}
Then $$g(z)$$ is convex and $$\nabla g$$ is $$\beta$$-Lipschitz. Moreover, thanks to the convexity of $$f$$, i.e.
\begin{align}
f(z)\geq f(x)+\nabla f(x)^T(z-x),
\end{align}
$$g$$ achieves its minimum at $$z=x$$.

Since $$\nabla g$$ is $$\beta$$-Lipschitz, by the Taylor expansion, for any $$z$$,
\begin{align}
g(z)\leq g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2.
\end{align}
By taking the minimum on both sides,
\begin{equation}
\begin{split}
g(x)
&=\min_z\{ g(z)\}\\
&\leq \min_z\left\{g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2\right\}\\
&=g(y)-\frac{1}{2\beta}\|\nabla g(y)\|_2^2,
\end{split}
\end{equation}
which is \eqref{e:sconvex1}.

We prove the stability bound for convex loss minimization via stochastic gradient method.

Assume that for all $$z$$, $$f(\cdot, z)$$ is convex and $$L$$-Lipschitz, and $$\nabla f$$ is $$\beta$$-Lipschitz. Suppose that we run the stochastic gradient method for $$T$$ steps with step sizes $$\alpha_t\leq 2/\beta$$, then
\begin{align}
\varepsilon_{Stab}\leq \frac{2L^2}{n}\sum_{t=1}^T\alpha_t.
\end{align}
Let $$S$$ and $$S’$$ be two samples of size $$n$$ differing in only a single example. Let $$\{w_t\}$$ and $$\{w_t’\}$$ be two runs of the algorithm with data sets $$S$$ and $$S’$$, and $$\delta_t=\|w_t-w_t’\|_2$$.

We couple the two algorithms with data sets $$S$$ and $$S’$$. At step $$t$$, there are two cases: with probability $$1-1/n$$, the algorithms choose the same data example; with probability $$1/n$$, the algorithms choose different data examples.

If the two algorithms choose the same data example $$z$$ at step $$t$$, then
\begin{equation}
\begin{split}
\delta_{t+1}^2
=&\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
=&\|w_t-w_t’\|^2 +\alpha_t^2\|\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
&-2\alpha_t\langle w_t-w_t’, \nabla f(w_t, z)-\nabla f(w’_t,z)\rangle\\
\leq& \|w_t-w_t’\|^2=\delta_t^2,
\end{split}
\end{equation}
provided that $$\alpha_t\leq 2/\beta$$, where we used \eqref{e:coercive}.

If the two algorithms choose different data examples $$z$$ and $$z’$$, then
\begin{equation}
\begin{split}
\delta_{t+1}&=\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|_2\\
&\leq \|w_t-w_t’\|_2+ \alpha_t\left(\|\nabla f(w_t, z)\|_2+\|\nabla f(w’_t,z))\|_2\right)\\
&\leq \delta_t+2\alpha_tL.
\end{split}
\end{equation}

Since the first case happens with probability $$1-1/n$$, and the second case happens with probability $$1/n$$. In average, we have
\begin{equation}
\begin{split}
\mathbb E[\delta_{t+1}]&\leq \left(1-\frac{1}{n}\right)\mathbb E[\delta_t]+
\frac{1}{n}\mathbb E[\delta_t+2\alpha_t L]\\
&=\mathbb E [\delta_t]+\frac{2\alpha_tL}{n}.
\end{split}
\end{equation}
Overall, we have
\begin{align}
\mathbb E[\delta_{T}]\leq\frac{2L}{n}\sum_{i=1}^T\alpha_t,
\end{align}
and
\begin{align}
|f(w_T,z)-f(W’_T,z)|\leq L \|w_T-w_T’\|_2\leq \frac{2L^2}{n}\sum_{i=1}^T\alpha_t.
\end{align}
This finishes the proof.

Posted on

## Mathematics of Deep Learning: Lecture 4 – PAC Learning and Deep Nets

Transcribed by Vishesh Jain (edited by Asad Lodhia and Elchanan Mossel)

# PAC Learning

We begin by discussing (some variants of) the PAC (Probably Approximately Correct) learning model introduced by Leslie Valiant. Throughout this section, we will deal with a hypothesis class or concept class, denoted by $$\mathcal{C}$$; this is a space of functions $$\mathcal{X}\rightarrow\mathcal{Y}$$, where $$\mathcal{X}=\{0,1\}^{n}$$ for some $$n\in\mathbb{N}$$ and $$\mathcal{Y}$$ is some finite set.

As a concrete example, we could consider formalizing the task of classifying whether a $$16\times16$$ bitmap depicts a cat, a dog, or neither, in which case we could take $$n$$ to be $$256$$, $$\mathcal{Y}$$ to be $$\{c,d,n\}$$ (where $$c$$ denotes “Cat”, $$d$$ denotes “Dog” and $$n$$ denotes “Neither”), and the concept class $$\mathcal{C}$$ to consist of a set of “classifiers” i.e. a subset of functions $$\{0,1\}^{256}\rightarrow\{c,d,e\}$$.

The proper PAC learning task is the following: given $$\mathcal{X},\mathcal{Y},\mathcal{C}$$, and a fixed but arbitrary distribution $$\mathcal{D}$$ on $$\mathcal{X}$$, suppose we are given samples $$\{(x_{i},h(x_{i}))\}_{i=1}^{m}$$, where $$h\in\mathcal{C}$$ and $$x_{i}$$ are i.i.d. samples generated according to the distribution $$\mathcal{D}$$. Then, we say that an algorithm $$A$$, which has access to the samples $$\{(x_{i},h(x_{i})\}_{i=1}^{m},\mathcal{X}, \mathcal{Y}$$ and $$\mathcal{C}$$ but no access to $$\mathcal{D}$$, is a proper PAC learner for the hypothesis $$h$$ under the distribution $$\mathcal{D}$$ if for all $$\epsilon,\delta>0$$, the algorithm $$A$$ (given $$\epsilon,\delta$$ as additional inputs) runs in time $$\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})$$ and requires $$\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})$$ many samples to return a classifier $$\tilde{h}\in\mathcal{C}$$ such that with probability at least $$1-\delta$$ over the examples
$P_{x\sim\mathcal{D}}[\tilde{h}(x)\neq h(x)]\leq\epsilon$
Finally, we say that the algorithm $$A$$ is a proper PAC learner for the concept class $$\mathcal{C}$$ if for every $$h\in\mathcal{C}$$ and for every distribution $$\mathcal{D}$$ on $$\mathcal{X}$$, the above holds. Note that the polynomials in the above definition are not allowed to depend on the distribution $$\mathcal{D}$$ or the “correct” hypothesis $$h$$.
A slight weakening of the above notion is that of an improper PAC learner, under which the $$\tilde{h}$$ returned by the algorithm $$A$$ need not belong to the concept class $$\mathcal{C}$$ (but all of the other properties of a PAC learner must still hold).

Before proceeding further, we make a few remarks on the assumptions underlying the PAC learning model. For many applications, the assumption that our samples are i.i.d. samples from a common distribution $$\mathcal{D}$$ can be very unrealistic. For instance, if our sample consists of images generated by taking snapshots of a video at a fixed frequency, then we expect to have some correlation between samples. On the other hand, the requirement that a PAC learner should perform well under any distribution on the domain space can be quite restrictive, and indeed, it is the case that proofs of theorems about non-PAC-learnability sometimes involve carefully constructing a “bad” distribution for a given algorithm. We also emphasize that the definition of PAC learnability that we have presented here seeks efficiency in terms of both time complexity as well as sample complexity. This is in contrast to statistical learning theory, where the focus is typically only on sample complexity.

Another notion, central to the definition of PAC learnability as presented above, is that of “realizability” i.e. the assumption that the data is generated by a function in our hypothesis class $$\mathcal{C}$$. This can obviously fail to hold in many cases of interest, for instance, when we choose a not-too-big hypothesis class $$\mathcal{C}$$ to avoid over-fitting. To deal with such situations we introduce the stronger notion of agnostic PAC learning.

The setup is the same as in proper PAC learning, except that instead of assuming that the data is of the form $$(x_i,h(x_i))$$ for some function $$h\in\mathcal{C}$$, we assume that the pairs $$(x_{i},y_{i})$$ are generated by some distribution $$\mathcal{D}$$ on $$\mathcal{X}\times\mathcal{Y}$$. Note, in particular, that since the distribution is over $$\mathcal{X}\times\mathcal{Y}$$ and not just on $$\mathcal{X}$$, we are also allowing for situations when $$(x,y)$$ and $$(x,y’)$$ can both have positive mass for $$y\neq y’$$ so the data cannot be generated by any function from $$\mathcal{X}\rightarrow\mathcal{Y}$$. We define the “quality of optimal approximation of $$\mathcal{D}$$ by $$\mathcal{C}$$” by
$E(\mathcal{D},\mathcal{C})\colon=\inf_{h\in\mathcal{C}}P_{(x,y)\sim\mathcal{D}}[h(x)\neq y],$
and we say that an algorithm $$A$$ is an agnostic PAC learner for a concept class $$\mathcal{C}$$ if for all $$\epsilon,\delta>0$$, the algorithm consumes $$\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})$$ time and samples and, with probability at least $$1-\delta$$ over the samples, returns $$\tilde{h}\in\mathcal{C}$$ such that
$P_{(x,y)\sim\mathcal{D}}\left[\tilde{h}(x)\neq y\right]\leq E(\mathcal{D},\mathcal{C})+\epsilon$
As before, we can relax agnostic PAC learning to improper agnostic PAC learning. Another relaxation is the notion of $$\alpha$$-agnostic PAC learning for some $$\alpha\geq1$$, where everything is exactly
as above, except we only require that
$P_{(x,y)\sim\mathcal{D}}\left[\tilde{h}(x)\neq y\right]\leq\alpha E(\mathcal{D},\mathcal{C})+\epsilon$

The general theory of PAC learning goes through with arbitrary domains $$\mathcal{X}$$ and arbitrary codomains $$\mathcal{Y}$$. For example, one could talk about the PAC learnability of a concept class $$\mathcal{C}$$ of functions $$\mathbb{R}^{n}\rightarrow\mathbb{R}$$. One of the things that changes when going from the discrete to the non-discrete setting is that the “loss” $$P_{(x,y)\sim\mathcal{D}}[h(x)\neq y]$$ for $$h\in\mathcal{C}$$ becomes too stringent, and often not useful. Instead, it is much more common to replace it with $$\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(h(x),y)]$$, where $$L\colon\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}$$ is a choice of “loss function”. For instance, when $$\mathcal{Y}=\mathbb{R}^{k}$$, common choices of $$L$$ include the
$$L^{1}$$ loss, $$L(y_{1},y_{2})=||y_{1}-y_{2}||_{L^{1}}$$ and the $$L^{2}$$ loss, $$L(y_{1},y_{2})=||y_{1}-y_{2}||_{L^{2}}^{2}$$.

We conclude this section with an example-sketch of how basic results in PAC learning are proved. Let $$\mathcal{X}=\mathbb{S}^{n-1}\subset\mathbb{R}^{n}$$ be the
unit sphere in $$\mathbb{R}^{n}$$ and $$\mathcal{Y}=[0,1]$$. Consider the concept class $$\mathcal{C}$$ consisting of functions $$\mathbb{S}^{n-1}\rightarrow[0,1]$$ of the form
$h_{w}(x)=\sum_{i=1}^{n}w_{i}x_{i}\qquad\text{with}\qquad w\in\mathbb{S}^{n-1}\subset\mathbb{R}^{n}.$
Finally, the loss function $$L$$ is the $$L^{2}$$ loss (or in this case, the quadratic loss) given by $$L(y_{1},y_{2})=(y_{1}-y_{2})^{2}$$. To show that $$\mathcal{C}$$ is agnostically PAC learnable, it suffices to exhibit an agnostic PAC learning algorithm $$A$$ for $$\mathcal{C}$$. We claim that the empirical risk minimizer, given by
$\tilde{h}=h_{w^{*}}=\sum_{i=1}^{n}w_{i}^{*}x_{i}$
for $$w^{*}=\text{argmin}_{w\in\mathbb{S}^{n-1}}\sum_{i=1}^{m}L(h_{w}(x_{i}),y_{i})$$ (where $$m$$ is the number of samples) is such a learner. There are two steps involved in showing this: first, we note that $$w^{*}=\arg\min_{w}\sum_{i=1}^{m}(h_{w}(x_{i})-y_{i})^{2}$$ has a well-known formula which can be computed in $$\text{poly}(m)$$ time; second, assuming that $$\{(x_{i},y_{i})\}_{i=1}^{m}$$ are i.i.d. samples from some distribution $$\mathcal{D}$$ on $$\mathbb{S}^{n-1}\times[0,1]$$, it follows from a standard concentration of measure argument that for $$m=\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})$$ samples, the empirical loss $$\frac{1}{m}\sum_{i=1}^{m}(\tilde{h}(x_{i})-y_{i})^{2}$$ will be $$\frac{\epsilon}{4}$$ close to the true loss $$\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(\tilde{h}(x),y)]$$ with probability at least $$1-\delta$$ over the samples, and one can use this, along with the compactness of $$\mathbb{S}^{n-1}\times[0,1]$$ to show that $$\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(\tilde{h}(x),y)]\leq\inf_{h\in\mathcal{C}}\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(h(x),y)]+\epsilon$$.

## PAC Learnable Net Models

In this section, we intend to give a flavour of PAC learning results for feedforward neural networks. This is a very young and active area of research, and most theoretical results are negative. However, in this section, our focus will be on positive results, following Livni, Shalev-Shwartz and Shamir. Using the same notation as in the previous section, we work in the setting where $$\mathcal{X}=\{0,1\}^{d}$$ and $$\mathcal{Y}=\{\pm1\}$$. Our concept classes will be feedforward neural networks depending on a few parameters, and we will denote them by $$\mathcal{C}_{t,n,\sigma,L}$$: $$t$$ denotes the depth of the network, $$n$$ denotes its size, $$\sigma$$ denotes the activation function, and $$L$$ denotes a constant such that for each neuron in the network, we have $$||w||_{1}+|b|\leq L$$, where $$w$$ denotes the vector of weights input to the neuron, and $$b$$ denotes the bias added by the neuron. Note that the function computed by such a network is real valued—we convert it to a $$\{\pm1\}$$ valued function by taking its sign. Moreover, in the learning theorems below, we will make both the “realisability” assumption as well as the “margin” assumption i.e. we assume that our learner receives samples of the form $$(x,\text{sign}(h^{*}(x)))$$, where $$h^{*}$$ is some (real-valued) function belonging to the class $$\mathcal{C}$$, and further, $$|h^{*}(x)|\geq1$$ for all $$x\in\mathcal{X}$$. Finally, we will consider improper learners.

[LSS, ’14] Constant-depth polynomial networks (i.e. the activation function is $$\sigma(y)=y^{2}$$) are learnable in $$\text{poly}(d)$$ time under the above assumptions.

Indeed, a polynomial network of depth $$\ell$$ always computes a polynomial function in $$d$$ variables of degree at most $$2^{\ell}$$. We can view such polynomials as living in a linear space of dimension $$k$$, where $$k\leq(d+1)^{2^{\ell}}$$ denotes the number of monomials of degree at most $$2^{\ell}$$ in $$d$$ variables. Since $$\ell$$ is constant, $$2^{\ell}$$ is also constant. Hence, we are reduced to learning a linear function on $$\text{poly}(d)$$ variables, which can be done in $$\text{poly}(d)$$ time (for instance, when a convex loss function is used) using standard learning algorithms.

By an argument involving the approximation of the sigmoid function by Chebyshev polynomials, the previous proposition can be used to show the following:

[LSS, ’14] The concept class $$\mathcal{C}_{2,n,\sigma_{\text{sigmoid}},L}$$ can be learned to accuracy $$\epsilon$$ with probability at least $$0.99$$ over the samples in time $$\text{poly}(T)$$, where $$T=\frac{1}{\epsilon}O(d^{4L\ln(11L^{2}+1)})$$.

We conclude this brief section by mentioning a recent learning result due to Goel, Kanade, Klivans and Thaler for feedforward neural networks with ReLU gates. In this theorem, we take $$\mathcal{X}=\mathbb{S}^{d-1}$$ and $$\mathcal{Y}=[0,1]$$. We mention that the theorem actually holds for a stronger notion called reliable learnability. We also note that the $$2^{O(1/\epsilon)}$$ factor appearing below precludes this from being a statement about (efficient) PAC-learnability, according to the definition in this lecture.

[GKKT, ’16] Let $$\mathcal{C}=\{x\mapsto\max(0,w\cdot x)$$, $$||w||_{2}\leq1\}$$. Then, there is a learning algorithm $$A$$ that reliably learns $$\mathcal{C}$$ in time $$2^{O(1/\epsilon)}.d^{O(1)}$$.

## Hardness of Learning

We now discuss results related to the hardness of learning. In general, hardness results state that it is algorithmically hard to find a solution to a problem, even though it is information theoretically possible to do so. Hardness of learning results are often based on hardness assumptions coming from cryptography or computational complexity. For instance, a typical hardness result might state that if one can learn a class of functions, then one can break a cryptosystem, or one can solve some canonical algorithmic problem (which is believed to be hard). An early result in this direction is the following proposition due to Kearns and Valiant:

[KV, ’94] Consider a public-key cryptosystem for encrypting individual bits by $$n$$-bit strings. Let $$\mathcal{C}$$ be a concept class that contains all the decryption functions $$d_{K}\colon\{0,1\}^{n}\rightarrow\{0,1\}$$ of the cryptosystem, one for each choice of key $$K=(K_{priv},K_{pub})$$. Let $$\epsilon(n)=P{}_{K,r}[d_{K}(e_{K,r}(0))\neq0$$ or $$d_{K}(e_{K,r}(1))\neq1]$$ be the probability of decryption error (over the choice of keys and randomization in the encryption). If $$\mathcal{C}$$ is weakly PAC-learnable in time $$t(n)$$ with $$t(n)\epsilon(n)=1/n^{\omega(1)}$$, then there is a distinguisher between the encryptions of $$0$$ and $$1$$ that runs in time $$O(t(n))$$.

In the above result, we are considering public-key cryptography with key $$K=(K_{priv},K_{pub})$$, randomized encryption (using the key $$K$$, and random bits $$r$$), and deterministic decryption (using the key $$K$$). Weak PAC-learnability relaxes the requirement that $$P_{x\sim\mathcal{D}}[\tilde{h}(x)\neq h(x)]\le \epsilon$$ to $$P_{x\sim\mathcal{D}}[\tilde{h}(x)\neq h(x)]<\frac{1}{2}-\frac{1}{n^{c}}$$ for some constant $$c$$ i.e. the learner performs non-negligibly better than random guessing. Finally, an algorithm $$A$$ is said to distinguish between the encryptions of $$0$$ and $$1$$ if some some universal constant $$c$$ $|P_{K,r}[A(K_{pub},e_{K,r}(1))=1]-P_{K,r}[A(K_{pub},e_{K,r}(0))=1]|\geq\frac{1}{n^{c}}$ The intuition behind the Kearns-Valiant theorem is that, since encryptions are easy to generate, one can view a collection of encryptions as training examples for learning the decryption function. But since learning the decryption function to a non-negligible advantage amounts to breaking the cryptosystem, it follows that if the cryptosystem is secure, we cannot learn the decryption function. Our focus in this section will be a result of Klivans and Sherstrov which states that PAC learning the concept class $$\mathcal{C}$$ of $$2$$-layer feedforward neural networks with the threshold activation function $$\sigma(x)=\boldsymbol{1}_{x\geq0}$$ is hard under a certain cryptographic assumption. More precisely, we have the following:

[KS, ’06] Assume that depth-2 polynomial-size circuits of threshold gates are PAC learnable. Then there is a polynomial-time solution to $$\tilde{O}(n^{1.5})-uSVP$$.

Here, $$f(n)-uSVP$$ refers to the unique shortest vector problem on an $$n$$-dimensional lattice, in which the goal is to find the shortest non-zero vector in a given lattice, provided that it is shorter by a factor of at least $$f(n)$$ than any other non-parallel vector. We take $$f(n)=\tilde{O}(n^{1.5})$$, an approximation factor for which this problem is believed to be hard. The theorem follows from the following result about the hardness of learning the intersection of half-spaces.

[KS, ’06] Assume that intersections of $$n^{\lambda}$$ halfspaces in $$n$$ dimensions are PAC learnable for some constant $$\lambda$$ > 0. Then there is a polynomial-time solution to $$\tilde{O}(n^{1.5})-uSVP$$.

Indeed, this result implies the previous one, since half-spaces i.e. subsets of the form $$a_{1}x_{1}+\dots+a_{n}x_{n}-\theta\geq0$$ are easily implemented by threshold gates, and so are intersections. Moreover, by an argument due to Regev, the theorem continues to hold even when all the half-spaces involved are “light” i.e. $$\sum_{i=1}^{n}|a_{i}|+|\theta|$$ is bounded by some polynomial in $$n$$.

To prove the latter theorem, one first uses a simple linearization argument (like the one we used in the previous section) to show that if the intersections of $$n^{\lambda}$$ halfspaces in $$n$$-dimensions are weakly PAC-learnable, then for any constant $$c>0$$, the intersections of $$n^{c}$$ degree-2 polynomial threshold functions (PTFs) i.e. functions of the form $$\boldsymbol{1}_{p(x)\geq0}$$ where $$p(x)$$ is is a polynomial of degree at most $$2$$, is also weakly PAC-learnable. Next, we show that the class of such intersections of degree-2 PTFs contains the decryption function of a cryptosystem whose security is based on the hardness $$\tilde{O}(n^{1.5})-uSVP$$. By the Kearns-Valiant result, it follows that if this class is weakly PAC-learnable, the cryptosystem is insecure, and hence $$\tilde{O}(n^{1.5})-uSVP$$ is solvable in polynomial time.

We now provide a description of this cryptosystem due to Regev. We begin by taking $$N=2^{8n^{2}}$$ and $$m=cn^{2}$$ where $$c$$ is a universal constant. The private key consists of a real number $$H\in[\sqrt{N},2\sqrt{N}]$$, and the public key consists of a vector $$(A_{1},\dots,A_{m},i_{0})$$ where $$i_{0}\in\{1,\dots,m\}$$, and each $$A_{i}\in\{0,\dots,N-1\}$$. Moreover, the $$A_{i}’s$$ are “almost” integer multiples of $$N/H$$ and $$A_{i_{0}}$$ is “almost” an odd integer multiple of $$N/H$$. To encrypt $$0$$, we pick a random subset $$S\subset[m]$$ and output $$\sum_{i\in S}A_{i}\mod N$$ ; to encrypt 1, we pick a random subset $$S\subset[m]$$ and output $$\lfloor\frac{A_{i_{0}}}{2}\rfloor+\sum_{i\in S}A_{i}\mod N$$. The idea is that $$0$$ will be encrypted to something which is close to an integer multiple of $$N/H$$, whereas $$1$$ will be encrypted to something which is close to an odd half-integer multiple of $$N/H$$. Accordingly, on receiving $$W\in\{0,\dots,N-1\}$$, the decryption system outputs $$0$$ if $$\{WH/N\}<\frac{1}{4}$$ and outputs $$1$$ otherwise. Here, $$\{a\}$$ for $$a\in\mathbb{R}$$ denotes the minimum distance between $$a$$ and an integer, where the distance between two real numbers is the absolute value of their difference. Moreover, by a standard argument, we may use the decryption system $$\{AW\}$$, where $$A$$ is a representation of $$H/N$$ to within $$\text{poly}(n)$$ fractional bits. The only thing that remains to be shown is that the decryption system for this cryptosystem can be implemented using the intersection of $$n^{\beta}$$ degree-2 PTFs for some constant $$\beta>0$$. We have already remarked that we can use a decryption function of the form $$\{AW\}$$ where $$A$$ has $$k=\text{poly}(n)$$ fractional bits. Moreover, since $$W$$ is integral, we can ignore the integral part of $$A$$. Denoting $$A=.b_{1}b_{2}\dots b_{k}$$ and $$W=\sum_{j=0}^{\ell-1}x_{j}2^{j}$$ with $$\ell\leq8n^{2}$$, we have
\begin{eqnarray*}
\{AW\} & = & \{\sum_{i=1}^{k}\sum_{j=0}^{\ell-1}b_{i}x_{j}2^{j-i}\}\\
& = & \{\sum_{i=1}^{k}\sum_{j=0}^{\min\{\ell-1,i-1\}}b_{i}x_{j}2^{j-i}\}
\end{eqnarray*}
since we can discard those terms $$b_{i}x_{j}2^{j-i}$$ which are integers. Denoting $$S(x)=\sum_{i=1}^{k}\sum_{j=0}^{\min\{\ell-1,i-1\}}b_{i}x_{j}2^{j-i}$$, we observe that $$S(x)$$ is an integral multiple of $$2^{-k}$$ which lies between $$0$$ and $$k$$. Let $$\{[a_{u},b_{u}]\}$$ denote the collections of intervals in $$[0,k]$$ which are not within $$1/4$$ of an integer. Note that there are $$k$$ such intervals, and that each interval $$[a,b]$$ can be recognized by the PTF $$\left(S(x)-\frac{a+b}{2}\right)^{2}\leq\left(\frac{b-a}{2}\right)^{2}$$. Hence, the negation of such an interval can be recognized by $$\left(S(x)-\frac{a+b}{2}\right)^{2}>\left(\frac{b-a}{2}\right)^{2}$$, and by taking the intersection of these $$k$$ degree-2 PTFs, we obtain the region of $$[0,k]$$ which is within $$1/4$$ of an integer, which is precisely the region where the decryption system outputs $$0$$.

To conclude, we mention some other hardness of learning results in the literature. Assuming that there is no efficient algorithm that never refutes a satisfiable random $$3$$-SAT formula with $$f(n)=n\log^{2}n$$ clauses, but refutes most such random formulae, Daniely and Shalev-Shwartz proved that learning the intersection of $$\log^{3}n$$ half-spaces is hard. Daniely also proved that agnostically learning a single half-space is hard, assuming that a related computational problem is hard on average. Along the lines of our comments about PAC learning, we note that results all show that there are some distributions and some functions which are hard to learn. It would be interesting to obtain hardness results for an a priori fixed, natural choice of distribution.

Posted on

## Mathematics of Deep Learning: Lecture 3 – More on depth separation.

Transcribed by Paxton Turner (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

In the Boolean circuits setup, there is a separation result based on depth of the following form:

(Rossman, Servedio, Tan): There exists a function $$f: \{0,1\}^n \to \mathbb{R}$$ that can be computed by a linear size depth $$d$$ formula using AND, OR, and NOT gates such that any depth $$d – 1$$ circuit that agrees with $$f$$ on $$1/2 + o(1)$$ inputs is of size $$\exp(n^{\Omega(1/d)})$$

Question: Can we prove a similar result for deep nets?

Last time we discussed the following separation result.

(Eldan, Shamir): There exists a probability measure on $$\mathbb{R}^d$$ and $$g: \mathbb{R}^d \to [-2,2]$$ which is:

• supported on $$\{x: ||x||^2 \leq Cd \}$$ and expressible by a 3-layer network of width $$\text{poly}(d)$$.
• for all $$f$$ supported by a 2-layer network of width $$\exp(\epsilon d)$$ it holds that
$\mathbb{E}_\mu[ |f – g|^2 ] \geq \epsilon.$

Today we will discuss a more elegant treatment of the 2 vs. 3 separation result which yields slightly stronger results.

# Danielly’s Model

Danielly’s Model is formed as follows. Let $$x, x’$$ chosen uniformly (i.e., with respect to the Haar measure) from $$S^{d-1}$$ and let $$f(x, x’) = g(\langle x, x’ \rangle)$$. The nets of depth $$\ell$$ are of the form:
\begin{eqnarray*}
\ell = 2 \qquad &w_2^{\text{T}} \sigma( W_1 x + W_2 x’ + b_1) + b_2 \\
\ell = 3 \qquad &w_3^{\text{T}} \sigma (W_2 \sigma(W_1 x + W_2 x’ + b_1) + b_2) + b_3 \\
\end{eqnarray*}
on the input $$(x, x’)$$. Here, we prove the following theorem.

If $$g:[-1,1] \to [-1,1]$$ is $$L$$-Lipschitz, then there exists a 3-layer representation $$F$$ of $$f$$ with width $$O(d^2 L / \epsilon)$$ and weights bounded by $$O(1 + L)$$ and $$||f – F||_{\infty} \leq \epsilon$$.

This illustrates the “magic of 3.” Intuitively, we see the power of quadratic models over linear ones, at least for this specific context.

We begin with the following lemma which is of the flavor of universality of depth 1 nets:

If $$f: [-R, R]$$ is $$L$$-Lipschitz for all $$\epsilon > 0$$, then $$f$$ can be approximated with $$\beta_i \leq R, \alpha_i \leq 2L$$, and $$\gamma_i \in \{-1, 1\}$$ as follows:
$\left\|f(x) – f(0) – \sum_{i=1}^m \alpha_i \sigma(\gamma_i x – \beta_i) \right\|_{\infty} \leq \epsilon, \, m \leq \frac{2RL}{\epsilon}$

Proof of Lemma (Sketch): Repartition the interval, and note that the Lipschitz condition implies the graph doesn’t deviate too far from a straight line.

Proof of the Theorem: We will prove the theorem in the following section and then show a corresponding lower bound.

## Upper bound

We first show that the function can be approximated well by a depth 3 net:
$\langle x, x’ \rangle = \Bigg\| \frac{x+x’}{2} \Bigg\|_2^2 – \frac{1}{2} = \sum_{i=1}^d \left( \frac{x + x’}{2} \right)^2,$

which implies that a 2-layer network approximates $$\langle x, x’ \rangle$$ well. We can also compute $$\sigma( \langle x, x’ \rangle)$$ as a linear combination of the previous layers. This in combination with the lemma implies the desired result.

## Lower bound

We want to show that a depth two net that approximates the function well is
very wide. For the analysis it will be useful to answer the following question:

Question: What is the distribution of $$\langle x, x’ \rangle$$?

Approximation: Rotation invariance implies that

$D(\langle x, x’ \rangle) = D'( \langle x, r \rangle ) = D( \langle x, e_1 \rangle) = D(x_1)$
Note that this is also true if $$x’$$ is deterministic and $$x$$ is random.

The individual components $$\langle x, e_k \rangle$$ of a uniform random vector in $$x \in S^{d-1}$$ approaches a Gaussian as $$d \to \infty$$. Indeed,
\begin{eqnarray*}
d \mu_d(x) &= \text{Vol}\left( (1 – x^2)^{1/2} S^{d-2} \right) \\
&= \frac{\Gamma(d/2)}{\sqrt{\pi} \Gamma((d-1)/2)} \left(1 – x^2 \right)^{\frac{d-3}{2}}
\end{eqnarray*}
which is a t-statistic distribution (and as $$d \to \infty$$ converges to a gaussian). The same reasoning applies also if one of the vectors x, x’ is fixed. Therefore, if $$g = \psi ( \langle v, x \rangle, \langle v’, x’ \rangle )$$, then
$||g||_2^2 = ||\psi||_{L^2(\mu_d \times \mu_d)}^2,$
and if $$f = \phi( \langle x, x’ \rangle )$$, then
$||f||_2^2 = ||\phi||_{L^2(\mu_d)}^2 = \int_{S^{d-1} \times S^{d-1}} f^2(x, x’) \, d(m \times m),$
where $$d(m \times m)$$ indicates the Haar measure.

Orthogonal Polynomials: Define $$P_0 = 1, P_1 = x,$$ and
$P_n = \frac{2n + d – 4}{n + d – 3} \cdot P_{n-1}(x) – \frac{n – 1}{n + d – 3} \cdot P_{n-2}(x)$
We observe that
$||P_n||_2^2 = 1/N_{n,d} \quad \text{where} \quad N_{n,d} = \frac{(2n + d – 2) (n + d – 3)!}{n! (d – 2)!}$
and that $$||P_n||_{\infty} = 1$$. Now define
$h_n(x, x’) = \frac{P_n(\langle x, x’ \rangle)}{\sqrt{N_{n,d}}} \quad \text{and} \quad L_n^x(x) = h_n(x,x).$
The key fact we use is that
$\langle L_i^v, L_j^v \rangle = \mathbb{E}[ L_i^v(x) L_j^{v’}(x)] = \mathbb{E}[h_i(\langle v, x \rangle) h_j(\langle v’, x \rangle) ] = \delta_{i,j} P_i(\langle v, v’ \rangle),$
which implies
$\mathbb{E}[h_n(x, x’) L_i^v(x) L_j^{v’}(x)] = 1(i = j = n) N_{n_1}^{-1/2} P_n(\langle v, v’ \rangle).$
To flesh out the above more carefully,
\begin{align*}
\mathbb{E}[h_n(x,x’) L_i^v(x) L_j^{v’}(x’)] &= \mathbb{E}_x\left[L_i^v(x) \mathbb{E}_{x’}[h_n(x,x’) L_j^{v’}(x’)]\right] \\
&= \mathbb{E}_x[L_i^v \delta_{j = n} P_n(v’, x)] \\
&= \mathbb{E}_x\left[L_i^v(x) \frac{L_j^{v’}(x)}{\sqrt{N_{n,d}}}\right]
\end{align*}

Lower Bound Idea: We now outline how to obtain the desired lower bound. Expand $$g(x, x’) = \sum \alpha_i h_i(x,x’)$$ and choose $$f$$ that is “noise sensitive”, i.e., has a lot of mass at higher Fourier levels. For the rest of the argument assume that $$f = h_n$$ for a large $$n$$ (though $$h_n$$ is not a Lip function, but we can find a Lip function that has a lot of weight on high levels).

We want $$||g – \sum_{j=1}^m g_j ||_2^2$$ to be large when $$g_j = b_j \sigma( \langle v_j, x \rangle) \geq c$$ unless either $$b_j$$ is large. Otherwise, there exists $$j$$ such that $$\langle g, g_j \rangle \geq \frac{c}{m}$$. Note that if $$h_n$$ is close to $$\sum_{j = 1}^m g_j$$, then we need to have at least one $$j$$ where $$\langle g_j, f \rangle \geq \frac{||f||_2^2}{4m}.$$

We now compute $$\langle g_j , f \rangle$$. Observe that
$g_j = \sum \beta_{k, \ell} L_k^v(x) L_{\ell}^{v’}(x’)$
$g_j(x,x’) = \psi(\langle v, x \rangle, \langle v’, x’ \rangle)$
$\sum \beta_{k, \ell}^2 = ||g_j||_2^2.$
Therefore,
$|\langle g_j, h_n \rangle | = N_{n,d}^{-1/2} |\beta_{n,n} P_n( \langle v, v’ \rangle )| \leq N_{n,d}^{-1/2} |\beta_{n,n}|$
so $$m$$ is very large or $$||g_j^2||_2^2$$ is very large. Informally, we get a bad approximation of $$h_n(\langle x, x’ \rangle)$$ unless

• Width $$\geq N_{n,d}^{\Omega(1)}$$ or
• Max-weight $$\geq N_{n,d}^{\Omega(1)}$$

Open Questions: 1.Can it really be that large weights solve the approximation problem? 2. Instead of working with the Haar measure, can this computation be performed in Gaussian space directly? We lose rotation invariance but it is a bit strange we work with measure that are in some sense getting close to Gaussian measure and cannot prove it directly in Gaussian space.

# Kane-Williams Model

Threshold gates are given by

$\mathbf{1}\left\{\sum w_i x_i \geq t\right\}.$

There exists a function $$A_n: \{0,1\}^n \to \{0,1\}$$ that can be computed

• Using a 3-layer threshold circuit with $$O(n)$$ gates,
• but requires at least $$\Omega(n^{3/2})$$ gates to be computed correctly on $$.5 + \epsilon$$ of inputs by a 2-layer circuit.

The basic unit threshold is
$f(x) = \text{sgn}(w_0 + \sum_{i = 1}^n w_i x_i).$

We assume a distribution that is uniform over $$\{0,1\}^n$$.

Meta-Question: Can we use reduction between models (eg, the previous result) to get better separation in the binary case?

Partial answer: Due to complexity-theoretic issues, better than $$\text{poly}(n)$$ isn’t possible.

Question: Can we replace new activations into the previous result (or this one)?

Answer: Isn’t written down, but should be straightforward.

First define the function $$A_n = A_{2^{k+1}}$$.
$M(x_1, \ldots, x_{2^k}, a_1, \ldots, a_k) = x_{a_1} \ldots x_{a_k}$
is a Mux function generalization. Next,
$A_n(x_1, \ldots, x_{2^k}, a_1, \ldots, a_{2^k}) = M(x_1, \ldots, x_{2^k}, \oplus_{i=1}^{2^k/k} a_i, \oplus_{i=2^k/k + 1}^{2 \cdot 2^k/k} , \ldots),$
where $$\oplus_{i=1}^{2^k/k} a_i$$ indicates the parity of the slice of the string.

Claim 1: $$M$$ can be computed depth 2 using $$O(n \log n)$$ gates.

Claim 2: Parity can be computed using threshold gates.

Proof of Claim 2: Let

$\sum x_i \geq 1 \to b_1$

$\vdots$

$\sum x_i \geq n \to b_n.$

It can be shown that there exist weights $$w_1, \ldots w_n$$ such that

$\text{sgn}(\sum w_i b_i) = \oplus x_i.$

In fact, this can be done with $$1 \leq w \leq \text{poly}(k)$$, though we don’t show this here.

Lower bound ingredients: We now move onto the lower bound.

Lemma: For all but $$\epsilon$$ of $$n$$-bit Boolean functions $$f$$, there is no depth 2 network of size less than $$o(\epsilon^2 2^n/n)$$ that agrees with $$f$$ on more than $$.5 + \epsilon$$ of the inputs.

Proof sketch of lemma: The number of threshold functions is less than $$2^{O(n^2)}$$, while the total number of functions is $$2^{2^n}$$. That is, there aren’t too many threshold functions in general. Also, for most Boolean functions $$f$$ on $$\{0,1\}^n$$, a 2-layer network of size $$O(\frac{\xi 2^n}{n^2})$$ does not approximate $$f$$ well.

Returning to the proof of the lemma, the idea is to generate a random function, allowing us to use the lemma’s statement that most functions do not approximate $$f$$ well. The construction is:

• Pick $$x$$ randomly.
• In each block of $$a_i$$, fix all coordinates but one randomly.
• This generates a random function on $$k$$ variables.

Question: How does a 2-layer look like after fixing randomly almost all bits?

Intuitive answer: Once enough bits are fixed, most gates become constant and die. Then we’re left with a smaller network which can’t approximate $$f$$ well.

Main Lemma: A random restriction of a threshold gate will result in a constant function with probability at least
$1 – O(\log n / n^{1/2}).$
Thus, the number of gates that remain is $$O(n/\log^2(n))$$, which is not enough to represent a random function.

# PAC Learning

We close this lecture with the definition of PAC-learning . We have a space $$X_n$$ and are interested in functions $$f: X_n \to \{0,1\}$$. Let $$\mathcal{C}$$ be the class of such functions (eg, 2-layer networks).

The class $$\mathcal{C}$$ is PAC-learnable if for all $$\epsilon, \delta$$ and for all distributions $$D$$ on $$X_n$$, there exists an algorithm $$A$$ such that given $$\text{poly}(n, 1/\epsilon, 1/\delta)$$ samples
$(x_i, y_i); \, x \sim D ; \, y_i = f(x_i); \, f \in \mathcal{C},$
with probability larger than $$1 – \delta$$, $$A$$ returns a function $$h: X_n \to \{0,1\}$$ such that
$\mathbb{P}[h(x) \neq f(x)] \leq \epsilon$
and $$A$$ runs in $$\text{poly}(n, 1/\epsilon, 1/\delta)$$.

Here, $$\delta$$ is interpreted as a global parameter , and $$\epsilon$$ is the probabilistic error .

Posted on

## Mathematics of Deep Learning: Lecture 2 – Depth Separation.

Transcribed by Julien Edward Clancy (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

Last lecture we saw the most basic theorem about neural network approximation: that for most activation functions, including any $$\sigma \colon \mathbb{R} \to \mathbb{R}$$ such that $$\sigma(x) \to 0$$ as $$x \to -\infty$$ and $$\sigma(x) \to 1$$ as $$x \to \infty$$, and also including the ReLU function, we can represent almost any function given enough neurons, with small error in any reasonable norm. One of the proofs relied on Fourier analysis. The simpler one was to observe that by scaling and translating to $$\sigma(\lambda x + b)$$ for large $$\lambda$$ and some $$b$$ we can approximate the function
$f(x) = \chi_{[a, \infty)}(x)$
for every $$a$$. By subtracting one of these from another we can approximate the indicator of any interval, and by standard arguments we can: 1) approximate any continuous function in the uniform sense, and 2) approximate any $$L^2$$ function in the $$L^2$$ sense. The last observation can also be adapted to Sobolev spaces, or even more detailed function classes.

An argument can be made that the result above has little to do with the practice of neural networks. Deep networks have gained currency because of their expressive power relative to their size and number of parameters — though the networks used in practice have millions of nodes, they dramatically outpace shallow networks of the same size. The first order of business of this lecture is to take a step towards explaining this.

# Telgarsky’s Depth Separation: 0-1 Loss

For the result we’re about to state (from a paper of Telgarsky that can be found here ) let’s consider ReLU networks. Let $$m(x)$$ be the piecewise linear function that is $$2x$$ if $$x \in [0, 0.5]$$, $$2-2x$$ if $$x \in [0.5, 1]$$, and $$0$$ otherwise. This looks like: This can be computed exactly by a two- or three-layer network, depending on your notational preferences, by the expression
$\sigma( 2 \sigma(x) – 4 \sigma(x – 0.5) )$
which is a diamond-shaped network: or in terms of the activation function: where the function $$m$$ is slightly offset for clarity. Since we’re looking at deep networks, what do iterates of this function look like? Here are the first four: Each iteration replaces the triangle with two triangles that are half as wide (but of the same height), so $$m^{(n)}(x)$$ has $$2^n$$ “teeth”. Moreover, by chaining these networks, we can obviously represent this function with $$3n+1$$ nodes in a deep network. However, it should take very many nodes to do this in a shallow network. Intuitively, depth increases the number of oscillations multiplicatively (which is why we have an exponential number of teeth) whereas width can only do so additively. The precise (and fairly strong) version of this intuition states:

Given a function $$f$$ and data points $$(x_i, y_i)$$, define its classification error by
$R(f) = \frac{1}{n} \sum_i \chi\{\tilde{f}(x_i) \neq y_i\}$
where $$\tilde{f}(x) = \mathbf{1}\{ f(x_i) > 1/2 \}$$ is the “sign-rounding” of $$f$$. Let $$x_i = i/2^n$$ and $$y_i = m^{(n)}(x_i)$$ where $$y = (0,1,0,1,\ldots)$$. Notice that $$R(m^{(n)}) = 0$$. If a network $$g$$ has $$\ell$$ layers and width $$w < 2^{(n-k)/\ell-1}$$, then $$R(g) > \frac{1}{2} – \frac{1}{3 \cdot 2^{k-1}}$$.
The right way to quantify the oscillatory quality of $$m^{(n)}$$, for the purposes of this proof, is by how many affine pieces it has. We say a function $$f$$ is $$t$$-sawtooth if it is piecewise affine with $$t$$ pieces. Clearly $$m$$ is $$4$$-sawtooth, and $$\sigma$$ is $$2$$-sawtooth. Let’s first show a quick lemma:

If $$f$$ is $$a$$-sawtooth and $$g$$ is $$b$$-sawtooth, then $$f + g$$ is at most $$a + b$$-sawtooth and $$f \circ g$$ is at most $$ab$$-sawtooth.
[of the Lemma] Addition is simple: the “joints” of $$f$$ must fall within intervals on which $$g$$ is affine, and adding an affine function does not change sawtooth degree. For composition, each interval on which $$f$$ is affine has as its domain at most the range of $$g$$, and therefore on this interval $$f(g)$$ looks like (a subset of) $$ag + b$$, which is $$b$$-sawtooth. Since there are $$a$$ such intervals by assumption, $$f \circ g$$ is $$ab$$-sawtooth.

From this result it is clear that a depth-$$\ell$$ and width-$$w$$ ReLU network must produce an output that is at most $$(2w)^\ell$$-sawtooth. If $$w < 2^{(n-k)/\ell-1}$$ then the network computes a $$r \leq 2^{n-k}$$-sawtooth function, whereas $$m^{(n)}$$ is precisely $$2^n$$-sawtooth. Now, let's examine each interval on which $$g$$ is affine. If the interval contains $$p$$ data points, then since $$g$$ is affine on this interval it can have the correct sign for at most $$\frac{p+3}{2}$$ of them. There are $$2^n$$ data points overall, and if we partition them into the $$2^{n-k}$$ intervals where $$g$$ is constant, then $$g$$ can have the correct sign for at most $$2^{n-1} + 3 \cdot 2^{n-k+1}$$ of them. This means that $$R(g) > \frac{1}{2} – \frac{1}{3 \cdot 2^{k-1}}$$

It is an instructive exercise to the reader to reproduce the example for the three-class, three-output extension (where the assigned class is taken to be the maximum of the three output values).

# Depth Separation: Squared Loss

More generally we are interested to approximate a function $$f \colon [0,1]^d \to \mathbb{R}$$ by a ReLU network $$\tilde{f}$$ of depth $$\ell$$ and width $$w$$, where the output level is linear and we used the squared loss to evaluate network performance. The general theme is to find structural properties of $$f$$ that make it hard to approximate by shallow networks — the theorem above was the first such result we’ll see, for a certain notion of oscillatory behavior for function in one dimension. Since deep nets are most used for high dimensional problems, it is interesting to try and classify how well do networks approximated such functions.

Recalling that shallow nets cannot have too many saw-teeth, we first obtain a more general result for quadratic functions:

Let $$p(x) = p_2 x^2 + p_1 x + p_0$$. If $$g$$ is $$n$$-sawtooth then there is some constant $$C > 0$$ such that
$\lVert p – g \rVert_2^2 \geq C p_2^2 / n^4$
The decay looks bad, but in the end it is practically optimal. This should not discourage the reader, however, because what is important for our purposes is the relative rate of deep vs. shallow networks — this result says that shallow networks can get at best an $$n^4$$ approximation rate, while deep networks can get about $$2^{4n}$$.
Since $$g$$ is piecewise affine let’s look at the $$L^2$$ error on each of its pieces separately. On each interval $$[b, b+2a]$$ the error looks like
$\int_b^{b+2a} \left| p_2 x^2 + p_1 x + p_0 – cx – d \right|^2 \, dx$
where $$g_{[b,b+2a]} = cx+d$$. The trick is to alter the function $$g$$ on each piece individually and look at the minimum error over all affine functions; adding these over all intervals will give the lower bound we want. The minimum of the above expression is
$\min_{c,d} \int_b^{b+2a} \left| p_2 x^2 – cx – d \right|^2 \, dx = \min_{c,d} \int_{-a}^{a} \left| p_2 x^2 – cx – d \right|^2 \, dx$
since translating does not change the top coefficient of a quadratic polynomial (and everything else has been absorbed into the minimum). Scaling the integration variable by $$a$$ we get
$\min_{c,d} a\int_{-1}^{1} \left| p_2 a^2x^2 – cax – d \right|^2 \, dx$
and taking out the top coefficient (again absorbing everything else) yields
$\min_{c,d} a^5 p_2^2 \int_{-1}^{1} \left| x^2 – cx – d \right|^2 \, dx = C a^5 p_2^2$
Now let $$\{[a_i,b_i]\}$$ be the intervals where $$g$$ is affine, and let their lengths be denoted as $$\{\ell_i\}$$. Then
$\lVert p – g \rVert_2^2 \geq \sum C p_2^2 \ell_i^5 \geq C \frac{p_2^2}{n^4}$
since $$1 \leq \sum \ell_i \leq \left( \sum \ell_i^5 \right)^{1/5} \left( \sum 1 \right)^{4/5}$$ so $$1/n^4 \leq \sum \ell_i^5$$ by Minkowski’s inequality.

One nice interpretation of the above result is that the coefficient of the $$x^2$$ term is the curvature of the function, so we’ve bounded the approximation capabilities of a neural network in terms of the curvature of the function being approximated.

## Generalizing to Strongly Convex Functions

That result covers quadratic functions. Using the theory of orthogonal polynomials we can extend the result to the case of arbitrary degree polynomials, or analytic functions, or really any $$L^2$$ function. The Legendre polynomials are one such family, starting out as
$1, x, \frac{1}{2} \left( 3x^2 – 1 \right), \ldots$
This sequence can be generated in many different ways, but one simple way is to do Gram-Schmidt on the set $$\{1, x, x^2, x^3, \ldots\}$$ (with the $$L^2([-1,1])$$ inner product, with which we will be working for the foreseeable future).

The theory of orthogonal polynomials is a rich one but for the theorem above it suffices to work with the Legendre polynomials. Another common family is the Hermite polynomials, defined among other ways as (the normalizations of)
$H_n(x) = (-1)^n e^{x^2} \frac{d^n}{dx^n} e^{-x^2}$
or
$H_{n+1}(x) = 2x H_n(x) – H_n'(x)$
starting at $$H_0 = 1$$, or implicitly by
$e^{2xt – t^2} = \sum_{n=0}^\infty H_n(x) \frac{t^n}{n!}$
or as the function $$e^{x^2}$$ multiplied by the solutions to the eigenvalue problem
$\mathcal{F} f = \lambda f$
where $$\mathcal{F}$$ is the Fourier transform, or many more definitions. The Legendre polynomials satisfy similarly many interesting relations. The interested reader is encouraged to learn more about these families and their applications.

Back to the topic at hand. Since the Legendre polynomials $$h_i$$ are an orthonormal family, and $$\{h_0, \ldots, h_n\}$$ spans exactly the space of polynomials of degree at most $$n$$, the minimum from before becomes the length of an orthogonal projection onto the space of polynomials orthogonal to linear polynomials, and kills exactly the terms in an expansion corresponding to $$h_1$$ and $$h_0$$, the constant and linear terms. More precisely, if $$f = \sum a_i h_i$$, then
$\min_{c, d} \int_{-1}^1 \left| f – cx – d \right|^2 = \sum_{i \geq 2} |a_i|^2 \geq |a_2|^2$
This gives the same curvature-type result as above, in an $$L^2$$ sense. However we can relate this to more canonical properties of a function as follows. We say that a function $$f$$ is $$\lambda$$-strongly convex (concave) in $$I$$ if $$f”(x) \geq \lambda$$ ($$\leq \lambda$$) throughout $$I$$. This condition easily implies that $$f'(t) \geq f'(0) + \lambda t$$ for $$t \geq 0$$, and $$f'(t) \leq f'(0) + \lambda t$$ for $$t \leq 0$$. For any $$f$$ we have
$a_2 = \langle f, h_2 \rangle \sim \int_{-1}^1 f(t) (3t^2 – 1) \, dt = – \int_{-1}^1 (t^3 – t) f'(t) \, dt$
and if $$f$$ is $$\lambda$$-strongly convex this is
$– \int_0^1 (t – t^3) f'(t) \, dt + \int_{-1}^0 (t – t^3) f'(t) \, dt$
The first integral is at most $$\int_0^1 (t -t^3) (f'(0) + \lambda t) \, dt$$, which by some elementary manipulation is at most $$C \lambda$$. The other integral is treated identically. Therefore the $$L^2$$ bound gives
$\lVert f – g \rVert^2 \geq C \frac{\lambda^2}{n^4}$

What about higher dimensions? Let $$f \in \mathcal{C}^2([0,1]^d)$$. If there is a unit vector $$v$$ and a connected set $$U$$ such that $$\langle v, H(f)(x) v \rangle \geq \lambda$$ on $$U$$, where by $$H_x(f)$$ we mean the Hessian matrix, then projecting along $$v$$ gives strong convexity, and integrating along these one-dimensional slices gives
$\lVert f – g \rVert \geq C \frac{\lambda^2}{n^4} \text{Vol}(U)$
There’s something a little unsatisfying about this condition, however, which is that in truly high dimensions we expect functions to be intrinsically more complex than in lower dimensions, and their approximation theory should scale accordingly. Neural networks excel at high-dimensional classification in practice, and this success is surely not limited to the type of one-dimensional phenomenon described above. However, let us keep in mind that while we think of the space of, say, $$1024 \times 1024$$ images as a million-dimensional vector space, it really is just a space of functions operating on two dimensions: the dimensionality of a sampling of a scene, which is exactly what a camera does, is a different notion than the dimensionality of the space on which the true scene, as a function, operates. This should be contrasted with (say) the case of DNA, where the “true” space actually does have nearly a hundred thousand dimensions.

# Other Depth Separation Results

The remainder of the lecture is an overview of a few results, with the ideas of proofs sketched or absent.

## Circuits

One potential model for approximability of functions (which is not exactly a neural network) is a logical circuit, with function complexity measured in circuit complexity. The idea is that we have a bit string input (in $$\{0,1\}^n$$) as the “input layer”, and at each layer we can pass the bits from the previous layer through the unary NOT or binary AND or OR gates; we call such circuits Boolean. The very strong result we want to compare to is:

( Rossman, Servedio, Tan ) For any fixed $$d$$, there is a function $$f \colon \{0,1\}^n \to \{0,1\}$$ computed by a linear circuit of depth $$d$$ and size $$O(n)$$ such that any depth-$$d-1$$ circuit computing a function that agrees with $$f$$ on a fraction $$1/2 + \delta$$ of the possible inputs, for $$\delta > 0$$, has size $$O(e^{n^{\Omega (1/d)}})$$.

This result follows a long line of work in theoretical computer science, in particular work by Hastad. We note, in comparison, that the separation results obtained by Telgarsky do not provide a depth separation between depth $$d$$ and $$d – 1$$ like the result above.

## Eldan-Shamir

As a step in this direction we will discuss a result of Eldan and Shamir showing separation between depths 2 and 3 for pretty general activation functions. More precisely, suppose $$\sigma$$ satisfies
$|\sigma(x)| \leq C(1 + |x|^\alpha)$
for some $$C$$ and $$\alpha$$, and assume it is sufficiently expressive in the sense that for all Lipschitz $$f \colon \mathbb{R} \to \mathbb{R}$$, constant outside $$[-R, R]$$, and $$\delta > 0$$, there exist $$a, \alpha_i, \beta_i, \gamma_i$$ such that
$\lVert f – a – \sum_i \alpha_i \sigma(\beta_i x – \gamma_i) \rVert_\infty < \delta$

( Eldan and Shamir ) There exists universal constants $$c, C$$ with the following properties. For all $$d$$ there exists a probability measure $$\mathbb{P}$$ on $$\mathbb{R}^d$$ and a function $$g \colon \mathbb{R}^d \to [-1,1]$$ supported in $$B_{C \sqrt{d}}(0)$$ expressible by a $$3$$-layer network with width $$\sim d^5$$, such for any $$f$$ expressible by a two-layer network with width $$c e^{c d}$$ we must have
$\mathbb{E}\left[|f – g|^2\right] \geq c$

Informally, there is a universal lower bound on the approximability of $$3$$-layer networks by $$2$$-layer networks.

[Sketch] In the end the function $$g$$ will be radial: $$g(x) = G(\lVert x \rVert^2)$$. Equivalently, its Fourier transform will be radial. In any bounded domain $$x_i^2$$ is Lipschitz so we can approximate it in one network layer by the expressivity assumption, therefore we can represent $$\lVert x \rVert^2$$ in one layer as well (if we wanted to output this directly we would need another layer for summation, but since we’re using as input to the next layer this summation is build into the neuron inputs). The next layer will be used to compute $$g$$. In Fourier space, choose $$\mu$$ to have density $$|\phi|^2$$, where $$\phi(\xi) = \mathcal{F}(\chi_{B_1(0)})(\xi)$$. Clearly this is radial, and
$\mathbb{E}_\mu |f – g|^2 = \int |f – g|^2 |\phi|^2 \, dm = \lVert \widehat{f \phi} – \widehat{g \phi} \rVert_2^2$
Since $$f$$ (the function we’re comparing $$g$$ to) is two-layered, we have
$f(x) = \sum_i f_i(\langle v_i, x \rangle)$
where here $$f_i(x) = \sigma(x – \beta_i)$$, but actually there’s enough freedom in the proof to choose different activations at each neuron in each layer provided they satisfy the properties assumed with uniform constants. By examining $$\widehat{f \phi}$$ as a convolution, $$\widehat{f \phi}$$ is supported in
$T = \cup \left(\text{span} (v_i) + B_1(0) \right)$
The essence is to design $$g$$ such that exponentially many of these “tubes” are required to cover its support. Intuitively, we want $$\widehat{g \phi}$$ to have a lot of mass away from the origin. The construction uses a randomized $$\tilde{g} = \sum_i c_i \epsilon_i g$$, where $$g_i$$ is radial and essentially $$\chi\{\lVert x \rVert \in \Delta_i\}$$, $$\Delta_i$$ is basically an interval of width $$1/N$$ about $$1/d$$, and $$\epsilon_i$$ is Rademacher. Then $$\tilde{g} \phi \sim \sum c_i \epsilon_i g$$ so $$\widehat{\tilde{g} \phi} \sim \sum \epsilon_i c_i \hat{g}_i$$. It’s possible to show with not too much difficulty that $$\hat{g}_i$$ has mass far away from the origin, but does $$\widehat{g \phi}$$ (with some nonzero probability)? What about cancellations? Let $$P$$ be the projection onto the part of the spectrum near the boundary of the ball. Then it turns out
$\mathbb{E}_\epsilon \left\lVert P \sum c_i \epsilon_i \hat{g}_i \right\rVert^2 = \sum_i \lVert P c_i \hat{g}_i \rVert^2$
which gives that there is indeed mass far from zero.
Posted on

## Mathematics of Deep Learning: Lecture 1- Introduction and the Universality of Depth 1 Nets

Transcribed by Joshua Pfeffer (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

# Introduction: A Non-Rigorous Review of Deep Learning

For most of today’s lecture, we present a non-rigorous review of deep learning; our treatment follows the recent book Deep Learning by Goodfellow, Bengio and Courville.

We begin with the model we study the most, the “quintessential deep learning model”: the deep forward network (Chapter 6 of GBC).

## Deep forward networks

When doing statistics, we begin with a “nature”, or function $$f$$; the data is given by $$\left\langle X_i, f(X_i) \right\rangle$$, where $$X_i$$ is typically high-dimensional and $$f(X_i)$$ is in $$\{0,1\}$$ or $$\mathbb{R}$$. The goal is to find a function $$f^*$$ that is close to $$f$$ using the given data, hopefully so that you can make accurate predictions.

In deep learning, which is by-and-large a subset of parametric statistics, we have a family of functions
$f(X;\theta)$
where $$X$$ is the input and $$\theta$$ the parameter (which is typically high-dimensional). The goal is to find a $$\theta^*$$ such that $$f(X;\theta^*)$$ is close to $$f$$.

In our context, $$\theta$$ is the network . The network is a composition of $$d$$ functions
$f^{(d)}( \cdot, \theta) \circ \cdots \circ f^{(1)}( \cdot, \theta),$
most of which will be high-dimensional. We can represent the network with the diagram where $$h_1^{(i)},\ldots,h_n^{(i)}$$ are the components of the vector-valued function $$f^{(i)}$$—also called the $$i$$-th layer of the network—and each $$h_j^{(i)}$$ is a function of $$(h_1^{(i-1)},\ldots,h_n^{(i-1)})$$. In the diagram above, the number of components of each $$f^{(i)}$$—which we call the width of layer $$i$$—is the same, but in general, the width may vary from layer to layer. We call $$d$$ the depth of the network. Importantly, the $$d$$-th layer is different from the preceding layers; in particular, its width is $$1$$, i.e. , $$f = f^{(d)}$$ is scalar-valued.

Now, what statisticians like the most are linear functions. But, if we were to stipulate that the functions $$f^{(i)}$$ in our network should be linear, then the composed function $$f$$ would be linear as well, eliminating the need for multiple layers altogether. Thus, we want the functions $$f^{(i)}$$ to be non-linear.
A common design is motivated by a model from neuroscience, in which a cell receives multiple signals and its synapse either doesn’t fire or fires with a certain intensity depending on the input signals. With the inputs represented by $$x_1,\ldots,x_n \in \mathbb{R}_{+}$$, the output in the model is given by
$f(x) = g\left( \sum a_i x_i + c \right)$
for some non-linear function $$g$$. Motivated by this example, we define
$h^{(i)} = g^{\otimes}\left( {W^{(i)}}^T x + b^{(i)} \right),$
where $$g^{\otimes}$$ denotes the coordinate-wise application of some non-linear function $$g$$.

Which $$g$$ to choose? Generally, people want $$g$$ to be the “least non-linear” function possible—hence the common use of the RELU (Rectified Linear Units) function $$g(z) = \text{max}(0,z)$$. Other choices of $$g$$ (motivated by neuroscience and statistics) include the logistic function $g(z) = \frac{1}{1 + e^{-2\beta z}}$ and the hyperbolic tangent $g(z) = \tanh(z) = \frac{e^z – e^{-z}}{e^z + e^{-z}}.$ These functions have the advantage of being bounded (unlike the RELU function).

As noted earlier, the top layer has a different form from the preceding ones. First, it is usually-scalar valued. Second, it usually has some statistical interpretation: $$h_1^{(d-1)},\ldots,h_n^{(d-1)}$$ are often viewed as parameters of a classical statistical model, informing our choice of $$g$$ in the top layer. One example of such a choice is the linear function $$y = W^T h + b$$, motivated by thinking of the output as the conditional mean of a Gaussian. Another example is $$\sigma(w^T h + b)$$, where $$\sigma$$ is the sigmoid function $$x \mapsto \frac{1}{1+e^x}$$; this choice is motivated by thinking of the output as a probability of a Bernoulli trial with probability $$P(y)$$ proportional to $$\exp(yz)$$, where $$z = w^T h + b$$. More generally, the soft-max is given by
$\text{softmax}(z)_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$
where $$z = W^T h +b$$. Here, the components of $$z$$ may correspond to the possible values of the output, with $$\text{softmax}(z)_i$$ the probability of value $$i$$. (We may consider, for example, a network with input a photograph, and interpret the output $$(\text{softmax}(z)_1,\text{softmax}(z)_2,\text{softmax}(z)_1)$$ as the probabilities that the photograph depicts a cat, dog, or frog.)

In the next few weeks, we will address the questions: 1) How well can such functions approximate general functions? 2) What is the expressive power of depth and width?

## Convolution Networks

Convolution networks , described in Chapter 9 of GBC, are networks with linear operators , namely, localized convolution operators using some underlying grid geometry. For example, consider the network whose $$k$$-th layer can be represented by the $$m \times m$$ grid We then define the function $$h^{(k+1)}_{i,j}$$ in layer $$k+1$$ by convolving over a $$2 \times 2$$ square in the layer below, and then applying the non-linear function $$g$$:
$h_{i,j}^{(k+1)} = g\left( a^{(k)} h_{i,j}^{(k)} + b^{(k)} h_{i+1,j}^{(k)} + c^{(k)} h_{i,j+1}^{(k)} + d^{(k)} h_{i+1,j+1}^{(k)} \right)$
The parameters $$a^{(k)}, b^{(k)}, c^{(k)},$$ and $$d^{(k)}$$ depend only on the layer, not on the particular square $$i,j$$. (This restriction is not necessary from the general definition, but is a reasonable restriction in applications such as vision.) In addition to the advantage of parameter sharing, this type of network has the useful feature of sparsity resulting from the local nature of the definition of the functions $$h$$.

A common additional ingredient in convolutional networks is pooling, in which, after convolving and applying $$g$$ to obtain the grid-indexed functions $$h_{i,j} ^{(k+1)}$$, we replace these function with the average or maximum of the functions in a neighborhood; for example, setting
$\overline{h}_{i,j}^{(k+1)} = \frac{1}{4} \left( h_{i,j}^{(k+1)} + h_{i+1,j}^{(k+1)} + h_{i,j+1}^{(k+1)} + h_{i+1,j+1}^{(k+1)} \right).$
This technique can also be used to reduce dimension.

## Models and Optimization

The next question we address is, how do we find the parameters for our networks, i.e. , which $$\theta$$ do we take? Also, what criteria should we use to evaluate which $$\theta$$ are better than others? For this, we usually use statistical modeling. The net $$\theta$$ determines a probability distribution $$P(\theta)$$; often, we will want to maximize $$P_{\theta}(y|x)$$. Equivalently, we will want to minimize $J(\theta) = – \mathbb{E} \log P_{\theta}(y|x),$ where the expectation is taken over the data (log likelihood). For example, if we model $$y$$ as a Gaussian with mean $$f(x;\theta)$$ and covariance matrix the identity, then we want to minimize the mean error cost $J(\theta) = \mathbb{E}\left[ \left\| y – f(x;\theta) \right\|^2 \right]$ A second example to consider: $$y$$ sampled according to the Bernoulli distribution with probability exponential in $$w^T h + b$$, where $$h$$ is the last layer; in other words, $$P(y)$$ is logistic with parameter $$z = w^T h + b$$.

How do you optimize $$J$$ both accurately and efficiently? We will not discuss this issue in detail in this course since so far there is not much theory on this question (though such knowledge could land you a lucrative job in the tech sector). What makes this optimization hard is that 1) dimensions are high, 2) the data set is large, 3) $$J$$ is non-convex, and 4) there are too many parameters (overfitting). Faced with this task, a natural approach is to do what Newton would do: gradient descent . A slightly more efficient way to do gradient descent (for us, yielding an improvement by a factor of the size of the network) is back-propagation , a method that involves clever bookkeeping of partial derivatives by dynamic programming.

Another technique that we will not discuss (but would help you find remunerative employment in Silicon Valley) is regularization . Regularization addresses the problem of overfitting, an issue summed up well by the quote, due to John Von Neumann, that “with four parameters I can fit an elephant and with five I can make him wiggle his trunk.” The characterization of five parameters as too many might be laughable today, but the problem of overfitting is as real today as ever! Convolution networks provide one solution to the problem of overfitting through parameter sharing. Regularization offers another solution: instead of optimizing $$J(\theta)$$, we optimize over $\tilde{J}(\theta) = J(\theta) + \Omega(\theta),$ where $$\Omega$$ is a “measure of complexity”. Essentially, $$\Omega$$ introduces a penalty for “complicated” or “large” parameters. Some example of $$\Omega$$ include the $$L_2$$ or $$L_1$$ (preferred to $$L_0$$ for convexity reasons). In the context of deep learning, there are other ways of addressing the problem of overfitting. One is data augmentation , in which the data is used to generate more data; for example, from a given photo, more photos can be generated by rotating the photo or adding shade (A rotated dog is still a dog!). Another is noising, in which noise is added either to the data (for example, by taking a photo and blacking it out) or to the parameters.

## Generative Models — Deep Boltzmann Machines

There are a number of probabilistic models that are used in deep learning. The first one we describe is an example of a graphical model . Graphical models are families of distributions that are parametrized by graphs, possibly with parameters on the edges. Since deep nets are graphs with parameters on the edges, it is natural to see whether we can express it as a graphical model. A Deep Boltzmann machine is a graphical model whose joint distribution is given by the exponential expression
$P(v,h^{(1)},\ldots,h^{(d)}) = \frac{1}{Z(\theta)} \exp\left( E(v,h^{(1)},\ldots,h^{(d)}) \right),$
where the energy $$E$$ of a configuration is given by
$E(v,h^{(1)},\ldots,h^{(d)}) = \sum {h^{(i)}}^T W^{(i+1)} h^{(i+1)} + v^T W^{(1)} h^{(1)}.$
Typically, internal layers are real-valued vectors; the top and bottom layers are either discrete or real-valued.

What does this look like as a graph—what is the graphical model here? It is a particular type of bipartite graph, in which the vertices corresponding to each layer is connected only to the layer immediately above it and to the layer immediately below it.

The Markov property says that, for instance, conditional on $$h_1$$, the distribution of a component of $$v$$ is independent of both $$h_2,\ldots,h_d$$ and the other components of $$v$$. If $$v$$ is discrete,
$P[v_i=1| h^{(1)} ] = \sigma( W_{i,*}^{(1)} h^{(1)}),$
and similarly for other conditioning.

Unfortunately, we don’t know how to sample from or optimize in graphical models in general, which limits their usefulness in the deep learning context.

## Deep Belief Networks

Deep belief networks are computationally simpler, though a bit more annoying to define. These “hybrid” networks are essentially a directed graphical model with $$d$$ layers, except the top two layers are not directed: $$P(h^{(d-1)},h^{(d)})$$ is given by
\begin{equation}
P(h^{(d-1)},h^{(d)}) = Z^{-1} \exp\left( {b^{(d)}}^T h^{(d)} + {b^{(d-1)}}^T h^{(d-1)} + {h^{(d-1)}}^T W^{(d)} h^{(d)} \right)
\label{star}
\end{equation}
For the other layers,
\begin{equation}
P( h^{(k)} | h^{(k+1)}) = \sigma^{\otimes}\left(b^{(k)} + {W^{(k+1)}}^T h^{(k+1)}\right).
\label{starr}
\end{equation}
Note that we are going in the opposite direction that we were before. However, we have the following fact: if $$h^{(k)}, h^{(k+1)} \in \{0,1\}^n$$ are defined by (\ref{star}), then they satisfy (\ref{starr}).

Note that we know how to sample the bottom layers conditional on layers directly from above; but for inference we also need the conditional distribution of the output given the input.

Finally, we emphasize that, while the $$k$$-th layer in the deep Boltzmann machine depends on layers $$k+1$$ and $$k-1$$, in deep belief, if we condition only on layer $$k+1$$, we can accurately generate the $$k$$-th layer (not conditional on other layers).

## Plan for the Course

In this class, the main topics we plan to discuss are

• the expressive power of depth,
• computational issues, and
• simple analyzable generative models.

The first topic addresses the descriptive power of networks: what functions can be approximated by networks? The papers we plan to discuss are

• “Approximations by superpositions of sigmoidal functions” by Cybenko (89).
• “Approximation capabilities of multilayer feedforward networks” by Hornik (91).
• “Representation Benefits of Deep Forward Networks” by Telgarsky (15).
• “Depth Separation in Relu Networks” by Safran and Shamir (16)
• “On the Expressive Power of Deep Learning: A Tensor Analysis” by Cohen, Or, Shashua (15).

The first two papers, which we will start to describe later in this lecture, prove that “you can express everything with a single layer” (If you were planning to drop this course, a good time to do so would be after we cover those two papers!). The subsequent papers, however, show that this single layer would have to be very wide, in a sense we will make precise later.

Regarding the second topic, hardness results we discuss in this course will likely include

• “On the computational efficiency of training Neural Networks” by Livni, Shalev Schwartz and Shamir (14).
• “Complexity Theory Limitation for learning DNFs” by Danieli
and Shalev-Schwartz (16).
• “Distribution Specific Hardness of learning Neural Networks” by Shamir (16).

On the algorithmic side:

• “Guaranteed Training of Neural Networks using Tensor Methods” by Janzamin, Sedghi and Anandkumar (16).
• “Train faster, generalize better” by Hardt, Recht and Singer.

Finally, papers on generative models that we will read will include

• “Provable Bounds for Learning Some Deep Representations” by Arora et. al (2014).
• “Deep Learning and Generative Hierarchal models” by Mossel (2016).

Again, today we will begin to study the first two papers on the first topic—the papers by Cybenko and Hornik.

# The Theorems of Cybenko and Hornik

In his 1989 paper, Cybenko proved the following result.

[Cybenko (89)] Let $$\sigma$$ be a continuous monotone function with $$\lim_{t \rightarrow – \infty} \sigma(t) = 0$$ and $$\lim_{t \rightarrow + \infty} \sigma(t) = 1$$. (For example, we may let $$\sigma$$ be the sigmoid function $$\sigma(t) = \frac{1}{1 + e^{-t}}$$.) Then, the set of functions of the form $$f(x) = \sum \alpha_j \sigma(w_j^T x + b_j)$$ is dense in $$C_n([0,1])$$.

In the above theorem, $$C_n([0,1]) = C([0,1]^n)$$ is the space of continuous functions from $$[0,1]^n$$ to $$[0,1]$$ with the metric $$d(f,g) = \sup|f(x)-g(x)|$$.

Hornik proved the following generalization of Cybenko’s result.

[Hornik (91)]
Consider the set of functions defined in the previous theorem, but with no conditions placed yet on $$\sigma$$.

• If $$\sigma$$ is bounded and non-constant, then the set is dense in $$L^p(\mu)$$, where $$\mu$$ is any finite measure on $$\mathbb{R}^k$$.
• If $$\sigma$$ is additionally continuous, then the set is dense in $$C(X)$$, the space of all continuous functions on $$X$$ ,where $$X \subset \mathbb{R}^k$$ is compact.
• If, additionally, $$\sigma \in C^m(\mathbb{R}^k)$$, then the set is dense in $$C^m(\mathbb{R}^k)$$ and also in $$C^{m,p}(\mu)$$ for every finite $$\mu$$ with compact support.
• If, additionally, $$\sigma$$ has bounded derivatives up the order $$m$$, then the set is dense in $$C^{m,p}(\mu)$$ for every finite measure $$\mu$$ on $$\mathbb{R}^k$$.

In the above Theorem, the space $$L^p(\mu)$$ is the space of functions $$f$$ with $$\int |f|^p d\mu < \infty$$ with metric $$d(f,g) = \left( \int |f-g|^p d\mu \right)^{1/p}$$. To begin the proof, we need a crash course in functional analysis.

[Hahn-Banach Extension Theorem]
If $$V$$ is a normed vector space with linear subspace $$U$$ and $$z \in V \backslash \overline{U}$$, then there exists a continuous linear map $$L: V \rightarrow K$$ with $$L(x) = 0$$ for all $$x \in U$$, $$L(z) = 1$$, and $$\left\|L\right\| \leq d(U,z)$$.

Why is this theorem useful in our context? Our proof of Cybenko and Hornik’s results is a proof by contradiction using the Hahn-Banach extension theorem. We consider the subspace $$U$$ given by $$\{ \sum \alpha_j \sigma(w_j^T x + b_j)\}$$, and we assume for contradiction that $$\overline{U}$$ is not the entire space of functions. We conclude that there exists a continuous linear map $$L$$ on our function space that restricts to $$0$$ on $$\overline{U}$$ but is not identically zero. In other words, to prove the desired result, it suffices to show that any continuous linear map $$L$$ that is zero on $$U$$ must be the zero map.

Now, classical results in Functional analysis state that a continuous linear functional $$L$$ on $$L^p(\mu)$$ can be expressed as
$L(f) = \int f g \,d\mu$
for some $$g \in L^q(\mu)$$, where $$\frac{1}{p} + \frac{1}{q} = 1$$. A continuous linear functional $$L$$ on $$C(X)$$ can be expressed as
$L(f) = \int f \,d\mu(x),$
where $$\mu$$ is a finite signed measure supported on $$X$$.
We can find similar expressions for linear functionals on the other spaces considered in the theorems of Cybenko and Hornik.

Before going into the general proof, consider the (easy) example in which our function space is $$L^p(\mu)$$ and $$\sigma(x) = \mathbf{1}(x \geq 0)$$. How do I show that, if $$L(f) = 0$$ for all $$f$$ in the set defined in the theorem, then the function $$g \in L^q(\mu)$$ associated to $$L$$ must be identically zero? By translation, we obtain from $$\sigma$$ the indicator of any interval, i.e., we can show that $$\int_a^b g d\mu =0$$ for any $$a < b$$. Since $$\mu$$ is finite ($$\sigma$$-finiteness is enough here), $$g$$ must be zero, as desired. Using this example as inspiration, we now consider the general case in the setting of Cybenko's theorem. We want to show that $\int_{\mathbb{R}^k} \sigma(w^t x + b) d\mu(x) = 0 \forall w,b$ implies that $$\mu = 0$$. First, we reduce to dimension $$1$$ using the following Fourier analysis trick: defining the measures $$\mu_a$$ as $\mu_a(B) = \mu(x: a^t x \in B),$ we observe that $\int_R \sigma(w^t x + b) d\mu_a(x) = 0$ Moreover, if we can show that $$\mu_a \equiv 0$$ for each $$a$$, then $$\mu \equiv 0$$ (a measure is defined by all of its projections''), since then $\hat{\mu}(a) = \int_{\mathbb{R}^k} \exp(i a^t x) d\mu(x) = \int_{\mathbb{R}} \exp(it) d\mu_a(t) = 0 \Rightarrow \hat{\mu} =0 \Rightarrow \mu = 0.$ (Note that we used the finiteness of $$\mu$$ here.) Having reduced to dimension $$1$$, we employ another very useful trick (that also uses the finiteness of $$\mu$$)---the convolution trick. By convolving $$\mu$$ with a small Gaussian, we obtain a measure that has a density, letting us work with Lebesgue measure. We now sketch the remainder of the proof. By our convolution trick, we have \begin{equation} \int_{\mathbb{R}^k} \sigma(w^t x + b) h(x) dx = 0 \quad \forall w,b \label{con} \end{equation} and want to show that the density $$h = 0$$. Changing variables, we rewrite the condition (\ref{con}) as $\int_{\mathbb{R}} \sigma(t) h(wt+b) dt = 0 \quad \forall w \neq 0,b$ To show that $$h=0$$, we use the following tool from abstract Fourier analysis. Let $$I$$ be the closure of the linear space spanned by all the $$h(wt+b)$$. Since $$I$$ is invariant under shifting our functions, it follows that it is invariant under convolution; in the language of abstract Fourier analysis, $$I$$ is an ideal with respect to convolution. Let $$Z(I)$$ denote the set of all $$\omega$$ at which the Fourier transform of all functions on $$I$$ vanish; then $$Z(I)$$ is either $$\mathbb{R}$$ or $$\{0\}$$ since if $$g(t)$$ is in the ideal then so is $$g(t w)$$ for $$w \neq 0$$. If $$Z(I) = \mathbb{R}$$ then all the functions in the ideal are constant $$0$$ as needed. Otherwise, $$Z(I) = \{0\}$$, then, by Fourier analysis, $$I$$ is the set of all functions with $$\hat{f}(0) = 0$$; i.e, all non-constant functions. But if $$\sigma$$ is orthogonal to all non-constant functions, then it follows that $$\sigma = 0$$. We conclude that $$Z(I) = \mathbb{R}$$, i.e. , $$h = 0$$, completing the proof.

Posted on