Transcribed by Frederic Koehler (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)
Today we will study a model interpolating between tree reconstruction (introduced last time) and deep learning (topic of this class).
Recall the model last time: we have a broadcasting process on a tree \(T\) with root \(r\) and \(\ell\) levels, where \(k\) samples, each of the form \(\{\sigma'(v)\}_{v \in \text{Nodes}(T)}\) are generated by the following process:
We let \(\sigma(v)\) be the \(k\)-tuple of the \(k\) i.i.d. samples of \(\sigma'(v)\). Last time we discussed how you can reconstruct the structure of \(T\) given \(\{\sigma(v)\}_{v \in \text{Leaves}(T)}\).
We will now introduce two models that build on one another and interpolate between tree reconstruction and deep learning.
The first step in our interpolation is to convert the problem into one of semi-supervised learning. We attach to each vertex \(v\) in the graph a set of labels \(L(v)\), with the constraint that is \(v\) is above \(w\) then \(L(v) \subset L(w)\), i.e. we only accumulate labels as we go down the tree. Furthermore we require that if \(a \in L(v_1) \cap L(v_2)\) then the lowest common ancestor \(v\) of \(v_1\) and \(v_2\) satisfies \(a \in L(v)\).
The motivation for this comes from evolution: we should think that the labels consist of tags like “Dog” and “German Shepherd” (in which case the \(\sigma(v)\) at the leaves model the genetic data of each species), and the rules give that every descendant of a dog is a dog and that every two dogs most common ancestor was also a dog.
Now that we have labels we can consider the following semi-supervised learning task: the data is \(\{\sigma(v)\}_{v \in Level(\ell)}\) and \(\{(\sigma(v), L(v)) : v \in S\}\) for some subset \(S\) of the leaves (\(Level(\ell)\)). Our task is to infer \(L(v)\) for all \(v \in Level(\ell)\). Roughly speaking this models the problem that we have the genetic code for many animals, some of which are labeled by tags such as “Dog” and “Cat” etc. and now we want to infer for all of the unlabelled animals whether they are a dog or cat as well.
Now we move on to the second step: we want to introduce nonlinear functions of multiple variables (like the sigmoid composed with linear map in the neurons of a neural network). We modify the way \(\sigma(v)\) is generated given its parent \(\sigma(w)\):
Thus the variable \(\sigma_{i}(v)\) is a function of noisy copies of two different entries of \(\sigma(w)\). The ability to choose \(\Sigma\) gives us freedom in which of these two entries they are, for example \(\sigma_i\) could depend on one of positions \(i\) and \(i + 2^r\) in its parent and this roughly models the way convolutional neural networks mix across different length scales in each layer.
Also note that applying just one \(\Pi \in S_{q^2}\) is already interesting; i.e. how can we recover the root from the leaves in a known tree? For example if \(\Pi\) is drawn uniformly at random, and only at the last layer, and \(k = 1\) then by symmetry \(\sigma(r)\) is independent from the values of \(\sigma\) at the leaves which is markedly different from the situation without \(\Pi\).
Roughly the analogy to a standard neural network is as follows: \(\Sigma\) represents the connections of neurons, \(\Pi\) is the nonlinear activation composed with a linear function and the noise is a “nuisance variable” that makes objects different (instead of them just being copies of each other). For example when \(\Sigma = Id, \Pi = Id\) and there is no noise, the network is just copying. We can also visualize a sort of “deep net” of a leaf where we look at the interchanges of the variables (given by \(\Sigma\)) between \(\sigma\) on the path from the root to the leaf.
We now move on to discussing learning tasks.
Formally our goal is to solve the semi-supervised learning problem explained above. Practically, in order to solve this problem we will try to learn hidden variables \(\Sigma_{|v|}\), \(\Pi_{w,v}\) and \(\sigma(v)\) (for \(v\) non-leaves) as well as the tree structure, and then use all of this information to determine labels. We begin with an easy claim.
For example, if we only have one labelled example of a tiger there is no way to infer that some unlabeled examples are tigers, since there is no way to even figure out if the direct parent of the labeled example is also labeled as a tiger.
On the other hand, if the tree is known and we have a good labeling then it is obvious that we can infer all of the correct labels: first propagate the labels up the tree from \(S\) using the good labeling condition, then propagate this information down to the leaves (since labels are hereditary).
What remains to discuss is the problem of recovering the tree. We will not attempt to use the labels for this task, so this part is an unsupervised learning task.
First we assume \(\Sigma\) is known. We consider the task of learning siblings.
Learning Siblings: First we consider how to test whether \(u,v\) are siblings with parent \(w\) given that \(\Pi_{w,u} = \Pi_{w,v} = Id\). Suppose this is the case, then if we let \[X := \# \{ i \in [k] : \sigma(v)_i = \sigma(u)_i \}\] we observe that there is a \(\theta^2\) chance of a given coordinate being “copied” from \(v\) to \(u\) and otherwise a \(1/q\) chance of the coordinate being the same, thus \[\mathbb{E}[X] = k \theta^2 + \frac{k(1 – \theta^2)}{q}\] and by the Chernoff bound this is very concentrated (sub-Gaussian with variance proxy of order \(k\)). Moreover if \(w\) and \(v\) are not siblings then the expectation is smaller, i.e. the first term is at least as bad as \(k \theta^4\) so for sufficiently large \(k\) we can distinguish these two cases with high probability.
In general, we can do this test by enumerating over all possible \(\Pi_{w,u}\) and \(\Pi_{w,v}\): if \(w\) and \(v\) are not siblings, then with high probability we will never observe a small \(X\), whereas if they are then when we choose the right permutations we will observe a small \(X\) with high probability.
Recursive Algorithm: Now that we can learn siblings, we can recurse similarly to the case of tree reconstruction. Thus we can recover the tree and under the condition of a good labeling we can perform semi-supervised learning with \(k = \theta(\ell)\).
We end by noting that if \(S\) is very large then there are other ways to learn labels without reconstructing the tree; e.g. if \(S\) consists of all but a few labels, then we can learn labels of unlabeled leaves just by finding their nearest-neighbor among the elements of \(S\).
Note that in the above result we used a recursive (“deep”) algorithm. This leads us to this natural question: can we show that simpler algorithms don’t perform as well?
First we will show that local algorithms require a suboptimal number of samples. The definition of a local algorithm is that when it labels \(w \notin S\), it uses only \(\{(\sigma(v), L(v)) : v \in S\}\) and \(\sigma(w)\); it does not use any of the other unlabelled data.
We choose the following input distribution over labels: Fix an integer \(h\) and \(h_0 < h\). There are no labels for the first \(h_0 - 1\) levels. At level \(h_0\), we label each vertex distinctly, in a random order. For each vertex \(v\) at level \(h_0\), we pick its leftmost and rightmost descendant (i.e. leftmost leaf of the overall tree under \(v\) and righmost leaf of the tree under \(v\)) and add these two vertices to \(S\).
Now the proof “trick” is to give the algorithm extra information, \((Label(v), \sigma(v))\) for each vertex \(v\) at level \(h_0\). Given this extra information, \(\sigma(w)\) is independent of \(\{\sigma(v) : v \in S\}\) by the Markov property (since removing the vertices at level \(h_0\) leaves \(w\) in a connected component by itself, relying on the “cheat” described above). We also make the algorithm’s life “easy” by setting \(\Pi = Id\) and \(\Sigma = Id\).
Now we have reduced the problem to a pure hypothesis testing question. Let \(H_i\) be the hypothesis that \(w\) is labelled by \(i\), i.e. \(w\) is a child of \(v_i\) where \(v_i\) is the vertex at level \(h_0\) labeled by \(i\). Note that the distribution of \(w\) given the true \(v_i\) is that of a noisy copy with parameter \(\theta^{h – h_0}\). Thus with probability at least \(1 – k \theta^{h – h_0}\) all of the coordinates of \(w\) have been resampled from the uniform distribution. By a coupling argument, it follows that any hypothesis test labels correctly with probability at most \[(1 – k\theta^{h – h_0}) (1/d) + k \theta^{h – h_0} \le (1/d) + k\theta^{h – h_0}.\]
Next we want to show that certain shallow algorithms also cannot recover as well as the “deep” algorithm we presented first. Formally, we define a shallow algorithm to be one that is given access only the following data:
In other words, we have summarized the labeled data by count statistics and do not let the algorithm look at higher order moments of the labelled data.
We consider the family of input instances where the direct children of the root are labeled \(1, \ldots, d\) and in each of their subtrees, every leaf is labeled except the rightmost child. Then we have the following theorem:
We do not present the whole proof of the theorem but just observe that it follows from known lower bounds on count reconstruction.
Finally we present an open problem: The lower bounds should be much better when \(\Pi\) are random and the \(\Sigma_j\) mix different elements. More specifically, the bound for shallow algorithms should hold if \(\theta = 1 – c^h\) for some \(c< 1\) ‘for some \(c < 1\), instead of requiring \(\theta < 1/\sqrt{d}\) when \(\Pi\) are random and \(\Sigma\) “mixes” well (e.g. \(\Sigma_r(x) = x + 2^r\)).
]]>Transcribed by Paxton Turner (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)
We prove Claim 2 which was left over from last time.
We begin with an \(\ell\)-level full binary tree and the size of the alphabet to be \(q = \infty\). We generate \(k\) independent samples of the broadcasting procedure with parameter \( \lambda \) on the tree.
In contrast to the last lecture, today, we do not know the tree structure , i.e. we only receive the colors of the nodes and do not know which nodes are siblings, cousins, etc. Our goal is to determine how large \(k\) must be to recover the tree structure.
In practice, these trees are phylogenetic trees, and the information we have are the DNA/amino acid sequences. Thus, usually \(q = 4\), or \(q = 20\).
Question : How large should \(k, \ell, \lambda\) be so that we recover the tree correctly as \(\ell \to \infty\)?
If \(u,v\) are graph distance \(2r\) apart, then
\[
\mathbb{P}[\sigma(u) = \sigma(v)] = \lambda^{2r}.
\]
How large should \(k\) be so that all distances can be computed accurately? Trivially, if \(k \lambda^{2\ell – 2} = 1/2\) and \(d(u,v) = 2 \ell – 2\), then taking a union bound gives
\[
\mathbb{P}[\exists \, \text{sample with} \, \sigma(v) = \sigma(u’)] \leq 1/2.
\]
Then we can’t say if \(d(u,v) = 2 \ell\) or \(d(u,v) = 2 \ell – 2\). If we use concentration, then
\[
k \geq \log(2^\ell) \left(\frac{1}{\lambda^\ell}\right)^2 \log(1/\delta)
\]
allows us to compute distances correctly with probability \(\geq 1 – \delta\). One can show this with Chernoff bounds, and the details are omitted here. If we compute all of the distances between the leaves correctly, then we trivially can recover the tree structure. However, as noted above, this requires at least \(k \geq c (1/\lambda)^{2 \ell} = |T|^\beta\) samples.
Computing all of the distances between the leaves seems like too strong of a requirement. This raises the question: Is there a deep algorithm that tries to infer the labels at the internal nodes sample by sample that can recover the tree with fewer samples? We will outline such an algorithm when \( 2 \lambda > 1\).
After detecting siblings, we want to understand the probability of being able to recover colors at their parents sample by sample. Consider first a small example. Suppose we have a 2-level binary tree with leaves labeled. If there is a 1 in both the left and right branches, then with probability 1, the root is colored 1. However, if no color appears in both branches, we mark the node with a question mark \(?\). Returning to our original problem, we can first apply this procedure to identify siblings among the leaves and the labels of their parents sample by sample, and then apply this to the nodes at level \( \ell – 1\), and so on.
How many samples are needed to identify all siblings correctly?
\[
k= \frac{(1 – \lambda)^4}{\lambda} \log \left( \frac{2^\ell}{\delta}\right)
\]
suffices for error \(\leq \delta\). It turns out that there is a similar bound for recovering the tree if \(2 \lambda > 1,\). As observed above, if we purely used distance methods, then we require \(k \geq c (1/\lambda)^{2 \ell} = |T|^\beta\). However, if \(2 \lambda > 1\), then our “deep” algorithm requires only \(k = \log(|T|)\) samples.
If \(2 \lambda < 1\), then what is true? The goal is to show you can't distinguish between the uniform distribution. Heuristically, we must estimate the probability of getting all question marks \(?\) for the lower bound.
Lower bound proof idea: Start with four trees having roots \(a, b, c,\) and \(d\). We want to couple the broadcasting procedure on these trees so that they produce the same data. In such a coupling, we can show
\[
\mathbb{P}[\text{generating different samples}] \leq 4(2 \lambda)^\ell.
\]
Thus,
\[
\mathbb{P}[\text{coupling succeeds}] \geq 1 – 4k(2 \lambda)^\ell.
\]
Thus if we want recover the tree correctly with probability \(\geq 1/\delta\), we need \(k \geq c(2 \lambda)^\ell = \Omega(|T|^{\ell})\) samples (if \( \lambda < 1\)).
We make the following general conjecture.
Conjecture : Recursive methods should give optimal complexity.
Next we consider a new model. Now we have a full binary tree, and on all edges, we copy with probability \(\eta\). Our alphabet now has size \(q = 2\), and in particular, we set the “letters” to be \(\{-1, 1\}\).
Using distance methods, we cannot distinguish between
\[
\mathbb{E}[\sigma(u) \sigma(v)] = \eta^{2 \ell}, \quad \mathbb{E}[\sigma(v’) \sigma(v)] = \eta^{2 \ell – 2}
\]
The lower bound for distance methods here is \(k \geq c/\eta^{2 \ell}\), and this is more or less tight. As before, we will instead use a recursive approach: first we identify siblings, and then estimate colors of the parents. To identify parents, use the correlations
\[
\mathbb{E}[\sigma(v) \sigma(v’)] = \eta^2
\]
if \(v\) and \(v’\) are siblings. We use correlations to recover distances up to a constant \(i\) levels above. And this requires \(\approx \log(2^\ell)\) samples to concentrate. Then we can use majority vote to recover the color of the parent. By a similar computation to the 2nd moment method from last lecture, if \(2 \eta^2 > 1\),
\[
\mathbb{P}[\text{Maj}(x_1, \ldots, x_{2^r}) = x_{\text{root}}] \geq \frac{1}{2} + \epsilon
\]
for all \(r\). Thus we define
\[
\hat{\sigma}_i = \text{Maj}\left(\sigma_i(u): u \, \text{leaf below} \, v \right).
\]
Note that
\[
\mathbb{E}[\sigma_{\text{root}} \hat{\sigma}_{\text{root}}] = \rho > 0 \Rightarrow \mathbb{E}[\hat{\sigma}_i(v) \hat{\sigma}_i(w)] = \eta^{2d(v,w)} \rho^2.
\]
Also note that \(\rho = \Theta(\epsilon)\).
Results If \(2 \eta^2 > 1\), we can recover the tree with \(k = \log(|T|)\) samples. Otherwise, \(k \geq |T|^\beta\). Compare this to the thresholds \(d \eta^2 > 1\) for count reconstruction and \(d \lambda_2^2 > 1\) for general broadcasting.
We define a new twist on the previous model. The parameter \(\eta\) is now non-uniform, but we still require it to satisfy \(2 \eta^2(e) \geq 1 + \epsilon\) for all edges \(e\). In this general set-up, one has to be more careful with detecting distances/siblings.
Four point method : This is a trick from phylogeny. For the purposes of detecting distances, there are 3 possible 2 level binary trees. The neighbors are either \(\{(A, B), (C,D)\}\), \(\{(A, C), (B, D) \}\), and \(\{ (A, D), (B, C) \}\). We use pairwise distances to determine which tree it is. Set
\[
Dist(u,v) = \log\left( \frac{1}{\mathbb{E}[\sigma_u \sigma_v]}\right).
\]
Now to determine siblings of \(v\), find all vertices with
\[
|Dist(u,v)| \leq 3 \log(1/\sqrt{2}),
\]
and apply the four point method on these vertices.
One can also run into an issue when estimating the colors of parents. To solve this, use an \(\ell\) level recursive estimator.
In the next lecture, we discuss hierarchical generative models.
]]>
Transcribed by Govind Ramnarayan (edited by Mathew Brennan, Asad Lodhia and Elchanan Mossel)
We start by describing some very simple hierarchical models on the \(d\)-ary tree . Recall that in the \(d\)-ary tree, each node in the tree has exactly \(d\) children (As opposed to the \(d\)-regular tree, in which the root has \(d\) children, and every other node has \(d-1\) children). Both models will feature an unknown character at the root, chosen according to some distribution. Each node propagates its character to its children with some Gaussian noise. The goal will be to recover the character at the root given the characters at some level \(\ell\).
We will denote the character at the vertex \(v\) by \(X_v\), with the root being denoted as 0.
\(X_0\) will be distributed according to an unknown distribution \(\mu\), with
\[
X_v = X_0 + \sum_{w \in \text{path}(0,v) \backslash \{0\}} \sigma \mathcal{N}_w
\]
where \(\mathcal{N}_w\) are i.i.d. \(N(0,1)\) random variables, and \(\sigma\) is a fixed parameter.
The goal is to estimate \(X_0\) given \(\left\{ X_{v}: v \in L_{\ell}\right\}\), the set of labels at level \(\ell\). However, this model will not be the one we focus on today.
This is similar to the above model, but instead \(X_0 \sim N(0,1)\) and if \(v\) is a parent of \(w\) then
\[
X_w = \eta X_v + \sqrt{1 – \eta^2} \mathcal{N}_w
\]
where \(\mathcal{N}_w\) are i.i.d \(N(0,1)\).
An important observation about this model is that \(X_w \sim N(0,1)\) for all nodes \(w\) .
Just like in the previous model, we want to estimate \(X_0\) given \(\left\{X_v: v \in L_{\ell}\right\}\). The question is, how should we do it? A natural attempt is to take the average of the children, which we now analyze.
Note that \((X_0, X_v: v \in L_\ell)\) is jointly Gaussian. This means that the estimate that minimizes the \(\ell_2\) error
\[
\mathbb{E}\left[X_0 \Big|(X_v)_{v \in L_{\ell}}\right]
\]
can be written as \(\sum_{v \in L_{\ell}} a_v X_v\), by the definition of being jointly Gaussian. Also, by symmetry of the model, all the \(a_v\)’s are the same, so we can write this as \(a \sum_{v \in L_{\ell}} X_v\).
So to find the correct estimate for \(X_0\), it suffices to find the value of \(a\) that minimizes
\[
\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right]
\]
which we now proceed to do. We begin by computing the mean squared error
\begin{align*}
\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right] &= \mathbb{E}[X_0^2] + a^2\sum \mathbb{E}[X_v^2] – 2ad^{\ell}\text{Cov}(X_0, X_v) + a^2 \sum_{v \neq w \in L_{\ell}} \text{Cov}(X_v, X_w) \\
&= 1 + d^{\ell}a^2 – 2ad^{\ell}\eta^{\ell} + a^2 d^{\ell} \sum\limits_{j=1}^{\ell} \eta^{2j} (d-1)d^{j-1} \\
&= 1 + d^{\ell}\left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot
\frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right)
\end{align*}
The second line follows from expanding out the sum and the fact that there are \(d^{\ell}\) nodes at level \(\ell\). The third line follows from the fact that the covariance of \(X_v\) and \(X_w\) is dependent on the distance between them in the underlying tree: namely, if the distance between them is \(D\), then \(\mathbb{E}[X_v X_w] = \eta^D\). Since all the vertices in the final sum are at the level \(\ell\), they are at an even distance on the tree (the distance is \(2j\), where \(j\) denotes the distance between the vertices and their lowest common ancestor on the tree). The fourth line follows from the formula for computing the sum of a geometric series.
We can compute the value of \(a\) that minimizes
\begin{equation}
1 + d^{\ell} \left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot \frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right)
\end{equation}
by taking the derivative with respect to \(a\) (this can also be accomplished approximately by setting the two terms in the sum to be equal and solving for \(a\)). Doing so, we get that
\begin{equation}
a = \eta^{\ell} \left(\frac{d-1}{d} \cdot \frac{(d\eta^2)^{\ell}}{d\eta^2 – 1} \right)^{-1}
\end{equation}
Now we realize that the optimal squared error \( \mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right] \) at this value of \( a \), as given by the computation above, behaves very differently when \(d \eta^2 < 1\) versus when \(d \eta^2 > 1\). When \(d \eta^2 < 1\), we notice that \(\frac{(d \eta^2)^{\ell + 1} - 1}{(d\eta^2) - 1}\) is approximately a constant. Furthermore, the optimal value of \(a\) is roughly \(\eta^{\ell}\), and so our expression for the optimal squared error becomes
\[
\mathbb{E}\left[\left(X_0 - a \sum X_v\right)^2\right] \approx 1 + d^{\ell}a(-2\eta^{\ell} + a \Theta(1)) \approx 1 +
d^{\ell}\eta^{2\ell} \approx 1
\]
Note also that a squared error of 1 would be accomplished by simply ignoring the leaves and estimating \(X_0\) by 0.
By contrast, when \(d\eta^2 > 1\), we argue that the optimal squared error is less than 1. Plugging in the minimizing value of \(a\) gives that
\begin{align*}
\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right] &= 1 + d^{\ell} \left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot \frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right) \\
&= 1 + d^{\ell}(-2a\eta^{\ell} + a\eta^{\ell}) \\
&= 1 – d^{\ell}\eta^{\ell} \left( \eta^{\ell} \frac{d}{d-1} \frac{d\eta^2 – 1}{(d\eta^2)^{\ell + 1} – 1} \right) \\
&\leq 1 – \frac{(d\eta^2)^{\ell}}{(d\eta^2)^{\ell + 1}} \cdot \frac{d}{d-1} \cdot (d\eta^2 – 1) \\
&< 1
\end{align*}
In summary, we have a threshold at \(d\eta^2 = 1\) for this simple model. If \(d\eta^2 > 1\), we have shown that we can estimate better than random with the optimal linear estimator. When \(d\eta^2 < 1\), we know that we cannot do better than random. But this model was a linear Gaussian model, which is not at all similar to the type of structure posed by deep learning (specifically, we know that deep learning has non-linearities that are applied at each layer, whereas the entire model here was linear). We’ll introduce some non-linearities in the next model to try to make it more interesting, and analyze the resulting behaviour.
In this model we deal with the same \(d\)-ary tree as before, but the initial distribution of \(X_0\) and the noise model we use will change.
Specifically, \(X_0\) will be distributed uniformly in \([q] = \{ 1,\ldots, q\}\) and if \(w\) is the parent of \(v\), then
\[
X_v = \begin{cases}
X_w & \text{ w.p. $\eta$} \\
\text{Unif}[q] & \text{ w.p. } 1 – \eta
\end{cases}
\]
Our goal is as follows.
Goal: Estimate \(X_0\) from \(\left(X_v: v \in L_{\ell}\right)\)
Before continuing, we note that the error of any estimator is always at least \(\Theta(1)\) (where we are thinking of \(d\) and \(\eta\) as constants), since there is a chance that there is no copying at the first level! But this was also a problem for Model 2.
A First Attempt (Plurality): One naive estimator is to take the value \(a \in [q]\) that is most common among the \(\left(X_v : v \in L_{\ell}\right)\) (i.e., taking the plurality vote amongst the leaves at level \(\ell\)). This is one of the most natural attempts, but is this the optimal estimator? We now analyse (a variant of) this estimator.
We’ll analyse a variant of the plurality estimator. Let
\[
y_v =
\begin{cases}
1 – 1/q & \text{ if } X_v = 1 \\
-1/q & \text{ if } X_v \neq 1
\end{cases}
\]
The estimator we will use is the following:
We won’t formally prove this claim now – it will follow trivially from the results in the rest of lecture. However, intuitively \(S\) synchronizes with the plurality estimator when the root symbol is 1, and is random otherwise. If plurality does well, it should do well when the root symbol is 1 by symmetry. It is clear that in this case, \(S\) succeeds whenever plurality succeeds, so it outperforms the random estimator. If \(S\) works correctly, it means that 1 appeared more than \(\frac{1}{q}\) times at the leaves very frequently. The only way plurality could work poorly in this case is if the plurality element is not 1 on many of these occasions. However, by symmetry, every non-root element has lower probability of showing up at the leaves than the root element, so this will only happen if, for many of those times, 1 showed up in just a little more than \(\frac{1}{q}\) fraction of the leaves, but was beaten by a non-root symbol due to random deviation. However, intuitively, in this case 1 would not have showed up more than \(\frac{1}{q}\) times so often, because random deviation would often make it go below \(\frac{1}{q}\). We will indeed see that this is the case in the remainder of lecture.
First we note a trivial claim about when we cannot hope to recover \(X_0\) better than random asymptotically.
How do we analyse this estimator? We’ll use the following three facts:
\[
\mathbb{E}\left[S | X_0 = 1\right] = \mathbb{E}\left[\text{v connected to root}\right] \cdot \left(1 – \frac{1}{q}\right) = (d \eta)^{\ell} \left(1 – \frac{1}{q}\right)
\]
\[
\mathbb{E}\left[S | X_0 \sim U\right] = 0
\]
which follows from the fact that each \(y_v\) is an unbiased random variable. Here \( U \) denotes the uniform distribution on \([q]\). Also,
\[
\text{Var}(S) \approx \Theta(d^{\ell})
\]
which follows from a direct computation very similar to the calculation in the previous model, where we minimized the \(\ell_2\)-error.
From the above three observations, we can conclude that there is some \(\epsilon > 0\) such that
\[
d \eta^2 > 1 \Rightarrow \frac{\mathbb{E}\left[S | X_0 = 1\right] – \mathbb{E}\left[S | X_0 \sim U\right]}{\sqrt{\text{Var}(S)}} > \epsilon
\]
and furthermore that
\[ d\eta^2 < 1 \Rightarrow \frac{\mathbb{E}\left[S | X_0 = 1\right] - \mathbb{E}\left[S | X_0 \sim U\right]}{\sqrt{\text{Var}(S)}} \to 0
\]
As we will see, in the case where \(d \eta^2 > 1\), we will be able to recover the root better than random by using the estimator \(S\).
In conclusion, this estimator performs very similarly to the scenario for the “boring” case of Model 2! But wait:
These questions are addressed by the following two theorems.
In the theorem below, we define count reconstruction estimators to be estimators that just depend on the number of each color \(q\) among the symbols at the leaves of the tree.
In particular, \(d\eta^2 = 1\) is the threshold for all sane count reconstruction estimators, like the one we have already described.
In the interest of time, we won’t actually prove both of these theorems. We will sketch some easy versions of a “proof” of the first of these two theorems.
Baby version (\(q = \infty\)): We first handle this easier special case. When \( q = \infty \), the labels in the model can be constructed as follows:
Now consider the estimator defined as follows: If two nodes \(u, v \in L_{\ell}\) have the root as their lowest common ancestor and \(x_u = x_v\), then return \(x_v\), and otherwise return an arbitrary color.
How do we extend this to large, finite \(q\)? First, we make a definition that will be relevant in our extension.
The extension to large, finite \(q\) will result from the following two claims:
We will delay proving these two claims for now. Instead, we proceed to give a good estimator for the case where \(q\) is large and finite, given these claims.
The Estimator: Look for a monochromatic, \(r\)-diluted binary tree of color \(i\). If it exists, return color \(i\).
By the first claim above, if the root is colored with \(i\), then the probability that such a tree exists with color \(i\) is \(\geq \epsilon\). Now suppose that the root is not colored with \(i\). Then we argue that there is no \(r\)-diluted binary tree of color \(i\) with very high probability. Let \(v\) be a child of a node \(w\) in the tree. First we note that
\begin{align*}
\Pr[x_v \neq i | x_w \neq i] &= \eta + (1 – \eta)\left(1 – \frac{1}{q}\right) \\
&= 1 – \frac{1}{q} + \frac{\eta}{q} \underset{q \to
\infty}{\longrightarrow} 1
\end{align*}
By taking \(q\) large enough such that \(1 – \frac{1}{q} + \frac{\eta}{q} \geq \eta’\) where \(\eta’\) is from the second claim above, we see that we cannot fit an \(r\)-diluted binary tree of color \(i\) in the remainder of the tree (since there is no space for the necessary 2 leaves at any level!).
Now we’ll prove the second claim. Before we proceed, we note that this proof was given in detail at the beginning of the proceeding lecture, so the proof here reflects the proof done in that lecture.
Define the function \(f\) as follows
\[ f(p) = \Pr[\text{Bin}(d^{r}, p) \geq d^{r} – 1]\]
We can see that \(f(1) = 1\), and furthermore that \(f\) is monotone in \(p\). Furthermore, we can compute \(f(p)\) by simply counting.
\[ f(p) = p^{d^{r}} + d^{r}(1 – p)p^{d^{r} – 1}\]
Now we compute the derivative of \(f\) at \(p=1\):
\[ f'(1) = d^{r} + d^{r}(1 – p)p^{d^{r} – 1} = 0\]
From this, we can conclude that for all sufficiently large \(p^* < 1\), we have that
\[
f(p^*) \geq p^*
\]
We choose a \(p^* > 1 – \epsilon\) and \(\eta’ < 1\) such that
\[ q = \Pr[\text{connected to } d^{r} \text{ descendents at distance } r] \geq \frac{p^*}{f(p^*)} \]
Finally, we can proceed to show that \(p_k \geq p^*\) for all \(k\) by induction. We note that \(p_0 = 1\), as the \(0\)-depth tree is just the root. Furthermore
\begin{align*}
p_{r+1} &\geq q \cdot \Pr[\text{Bin}(d^{\ell}, p_r) \geq d^{\ell} - 1] \\
&\geq \frac{p^*}{f(p^*)} \cdot f(p_r) \\
&\geq \frac{p^*}{f(p^*)} f(p^*) = p^*
\end{align*}
where the final inequality proceeds from the induction hypothesis that \(p_r \geq p^*\) and the monotonicity of \(f\) in \(p\).
For the rest of the lecture, we will sketch the proof of the theorem that count reconstruction estimators will always fail for \(d \eta^2 < 1\) (the word sketch should be re-emphasized here). Informally, this says that if \(d \eta^2 < 1\), then we cannot reconstruct the character at the root better than random simply by looking at the count statistics on the leaves.
As a concrete example, we can use this CLT for the counts to give a CLT for our earlier count estimator \(S\). Let \(\psi(i) = e_1(i) – \frac{(1, \ldots, 1)}{q}\). Then we can see that \(y_v = \psi(x_v)\), and we can apply the Kesten-Stigum Theorem for \(S_{\ell} = \sum_{v \in L_{\ell}} y_v\).
We now give a high level sketch of why the Kesten-Stigum theorem works for this sum. The idea is to use the right martingale. We consider
exposing one node of the tree at a time, and subtracting the “expected value” of the node given its parent to keep the exposed variables zero mean. More formally, define
\[ S_{\ell}’ = \sum_{v: |v| \leq \ell} (d \eta)^{-|v|}(y_v – \eta y_{\text{parent}(v)})\]
where \(|v|\) denotes the level of the vertex \(v\) in the tree (and \(y_{\text{parent}}(\text{root}) = 0\)). Note that \(\mathbb{E}\left[y_v | y_{\text{parent}(v)}\right] = \eta \cdot y_{\text{parent}(v)}\), so whenever we add a vertex to the sum, the expected value of the sum remains the same. By induction, it follows that when we add an entire level to the sum, it remains the same. Hence, we can conclude that
where the second item follows by induction and noting that the sum telescopes on adjacent levels. We know that \(S_0 = \sum_{v: |v| = 0} (d \eta)^{0} y_v\), so the claim holds for \(\ell = 0\). We now assume that \(S_{\ell}’ = \sum_{v: |v| = \ell} d^{- \ell} \eta^{- \ell} y_v\) and prove the statement for \(\ell + 1\).
\begin{align*}
S_{\ell + 1}’ &= \sum_{v: |v| \leq \ell+1} (d \eta)^{-|v|}(y_v – \eta y_{\text{parent}(v)}) \\
&= S_{\ell} + \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v – \sum_{v: |v| = \ell + 1} (d \eta)^{-\ell – 1} \eta y_{\text{parent}(v)} \\
&= S_{\ell} + \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v – \sum_{v: |v| = \ell} (d \eta)^{\ell} y_v \\
&= \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v
\end{align*}
Now, we would like to appeal to a Martingale Central Limit Theorem to establish that \(S_{\ell}’\) satisfies a CLT that is independent from the root. Note that Azuma’s inequality would tell us that \(\lim_{\ell \to \infty} S_{\ell}’\) is highly concentrated around \(S_0\) if we knew that \(\sum\limits_{k=0}^{\infty} c_k^2 < \infty\), where the \(c_k\) denote the Martingale differences. In this case, the leaves would not satisfy a CLT independent of the root.
\begin{align*}
\sum_{\ell’ \leq \ell}(S_{\ell’} – S_{\ell’ – 1})^2
&\geq \sum_{\ell’ \leq \ell} d^{\ell’}(d \eta)^{- 2 \ell’} \Theta(1) \\
&= \Theta(1) \sum_{\ell’ \leq \ell} \left(\frac{1}{d\eta^2}\right)^{\ell’}
\end{align*}
If \(d \eta^2 < 1\), this series diverges, which means that $c$ satisfies a CLT independent of the root. However, this is not quite enough; a CLT tells us a statement of the form \(\mathbb{P}[\sum X_i \in [a - c\sqrt{n}, a + c\sqrt{n}]]\), but just knowing that the counts are concentrated in an interval does not rule out the possibility of an estimator that can reconstruct the root from the parity of \(c_1^{\ell}\) (or similar functions that are not constrained by being in an interval). We note that it does not really make sense for any sane estimator to use the parity when trying to reconstruct the root, but we still have to rule it out!
So we will additionally need a Local Central Limit Theorem, of the form
The local CLT helps us rule out “parity-like” estimators that still only depend on the counts at the leaves.
]]>Transcribed by Jiaoyang Huang (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)
For today’s lecture we will present two papers. The first one is Provable Bounds for Learning Some Deep Representations by Arora, Bhaskara, Ge and Ma. They proved that random sparse neural networks are learnable. The second paper is Train faster, generalize better: Stability of stochastic gradient descent by Hardt, Recht and Singer. Their main results state that any model trained with stochastic gradient method in a reasonable amount of time attains small generalization error.
The sparse random network model has \(\ell\) layers of vectors of binary variables \(h^{(\ell)}, h^{(\ell-1)},\cdots, h^{(1)}\) (where \(h^{(\ell)}\) is the top layer) and an observed layer \(h^{(0)}\) at bottom. Each layer has \(n\) nodes.
\begin{align}
\mathrm{Depth}=\ell, \qquad \mathrm{Width}=n.
\end{align}
The activation function is the sgn function, i.e.
\begin{align}
\sigma(x)=\left\{
\begin{array}{cc}
1 & x>0,\\
0 & x=0,\\
-1 & x<0.
\end{array}
\right.
\end{align}
The network between two layers is a random graph with average degree \(d=n^{\gamma}\) with \(0<\gamma\leq 0.2\). More precisely, each node at level \(k\) is connected to each node at level \(k-1\) with probability \(n^{\gamma-1}\) independently. The edge weights are random in \(\{\pm 1\}\) with probability \(1/2\) each. For any vertex \(i\) in the \(k\)-th layer, and vertex \(j\) in the \((k-1)\)-th layer, the weight \(W_{ji}^{k-1}\in \{-1,0,1\}\) is such that
\begin{align}
W_{ji}^{k-1}
=\left\{
\begin{array}{cc}
0 & \text{there is no edge between \(i,j\).}\\
-1 & \text{there is an edge with weight \(-1\) between \(i,j\).}\\
1 & \text{there is an edge with weight \(1\) between \(i,j\).}
\end{array}
\right.
\end{align}
We remark that \(W^{k-1}\) is the weighted adjacency matrix between \((k-1)\)-th and \(k\)-th layers.
The samples are generated from top to bottom. Given \(h^{(k)}\), the vector in the \(k\)-th layer, the vector in the \((k-1)\)-th layer is \(h^{(k-1)}=\mathrm{sgn}(W^{k-1}h^{(k)})\),
\begin{align}
h^{(k-1)}_j=\mathrm{sgn}\left(\sum_{i=1}^n W^{k-1}_{ji}h^{(k)}_i\right),\quad j=1,2,\cdots, n.
\end{align}
We sample the top layer \(h^{(\ell)}\), where the set of nodes that are \(1\) is picked uniformly among all sets of size \(\rho_\ell n\). \(\rho_\ell\) is called the density of the top layer, and we assume that \(\rho_\ell (d/2)^\ell=O(1)\). Since averagely, each node has \(d\) children nodes, the density at level \(k\) is roughly of size \(\rho_\ell d^{\ell-k}\ll1\), and \(h^{(k)}\) is also a sparse vector.
The main theorem we will prove is the following:
Thanks to the sparsity, the net should approximately preserve information, and allows easily going back/forth between representations in two adjacent layers. In the following we first focus on learning a single-layer network. It has a top layer with vector \(h^{(1)}\) and a bottom layer with vector \(h^{(0)}\). We assume that \(h^{(1)}\) is sampled among sparse vectors, and then \(h^{(0)}\) is generated using \(h^{(1)}\). To learn a single-layer network, we need first to recover the network structure. Then for each sample, we need to recover \(h^{(1)}\) from \(h^{(0)}\).
Given a single-layer network (with known graph structure), the property that we can recover \(h^{(1)}\) from \(h^{(0)}\) is characterized by the following definition:
We will show that a single-layer random network is a denoising autoencoder if the input layer and noise are both sparse.
Let \(S\) be the support of the top layer \(h^{(1)}\), and \(S’\) be the support of the bottom layer \(h^{(0)}\). The key property for the sparse network is that most neighbors of an activated node are unique. More precisely, fix \(i\) in the top layer, and \(j\in {\cal N}(i)\) in the bottom layer, where \({\cal N}(i)\) is the set of neighbor vertices of \(i\), i.e. \(W_{ji}\neq 0\). Thanks to the sparsity assumption, with high probability, \(j\) has no non-zero neighbor other than \(i\), i.e.
\begin{align}
{\mathbb P}(|{\cal N}(j)\cap S\setminus \{i\}|\geq 1)\leq \rho d<0.1.
\end{align}
If \({\cal N}(j)\cap S\setminus \{i\}=\emptyset\), and \(h_j^{(0)}\) was not flipped by the noise, then \(h_{i}^{(1)}=W_{ji}h_j^{(0)}\). Moreover,
\begin{align}
{\mathbb P}({\cal N}(j)\cap S\setminus \{i\}=\emptyset {\text{ and \(h_j^{(0)}\) was not flipped by the noise}})< 0.2.
\end{align}
Motivated by these observations, we can recover \(h^{(1)}\) using the following quantity,
\begin{align}
I_{i}=\sum_{j\in {\cal N}(i)}W_{ji}h_j^{(0)}.
\end{align}
If \(|I_i|\geq d/2\), then we set \(h_i^{(1)}=sgn(I_i)\). Otherwise, if \(|I_i| < d/2 \), we set \(h_i^{(1)}=0\).
In the following we focus on learning the structure of a single-layer network. We recall that \(h^{(1)}\) is sampled among sparse vectors (with density \(\rho\)), and then \(h^{(0)}\) is generated using \(h^{(1)}\). The graph structure can be recovered using \(\mathrm{poly}(n)\) samples.
The high level network inference algorithm in the paper learns the network layer by layer starting from the bottom. For the network between \(i\)-th and \((i+1)\)-th layers, the algorithm first recover the network structure between levels \(i\) and \(i+1\). Then it recovers \(h^{(i+1)}\) from \(h^{(i)}\) for each sample as above.
There is a technical issue. In the original model, we sample \(h^{(\ell)}\) uniformly among all the sparse vectors, then sequentially generate \(h^{(\ell-1)}, h^{(\ell-2)}, \cdots, h^{(0)}\). The bits of \(h^{(i+1)}\) are not independent! We can not directly use the Lemma above to recover the network structure between levels \(i\) and \(i+1\). However, this problem can be resolved by showing that the bits of \(h^{(i+1)}\) cannot be too correlated, unless they have a common ancestor. Moreover, the correlation decays exponentially, i.e.
\begin{align}
\text{grandparents correlation }\leq 1/d \text{ parent correlation.}
\end{align}
Let’s consider the following general setting of supervised learning. There is an unknown distribution \(\cal D\). We receive a sample \(S=(z_1,z_2,\cdots, z_n)\) of \(n\) i.i.d. examples drawn from \(\cal D\). Let \(f(w, z)\) denote the a loss function where \(w\) specifies the model and \(z\) denotes an example. Our goal is to minimize the population risk:
\begin{align}
R[w]:= {\mathbb E}_{z\sim \cal D}[f(w,z)],
\end{align}
over all possible models \(w\). Since the distribution \(\cal D\) is unknown, we cannot measure the quantity \(R[w]\) directly, we instead minimize the empirical risk:
\begin{align}
R_S[w]:=\frac{1}{n}\sum_{i=1}^n f(w,z_i).
\end{align}
The generalization error of a model \(w\) is the difference
\begin{align}
R_S[w]-R[w].
\end{align}
If \(w=A(S)\) is chosen as a function of the data by a possibly random algorithm \(A\), we need to estimate the expected generalization error
\begin{align}
\varepsilon_{gen}:=\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]].
\end{align}
In order to bound the generalization error, we recall the following definition of uniform stability,
In the following we show that for stochastic gradient iterative algorithms, we can control their uniform stability, and obtain some upper bound on the generalization error.
Given the loss function \(f(w,z)\), the stochastic gradient update with learning rate \(\alpha_t>0\) is given by
\begin{align}
w_{t+1}=w_t-\alpha_t \nabla_w f(w_t, z_{i_t})
\end{align}
where the indices \(i_t\) are either picked uniformly randomly in \(\{1,2,\cdots,n\}\), or picked according to a random permutation.
In the rest of the lecture, we analyze the stochastic gradient descent with convex minimization. Let me first discuss a little bit of convexity,
For the proof of \eqref{e:sconvex1}, we introduce a new function
\begin{align}
g(z)=f(z)-\nabla f(x)^T z.
\end{align}
Then \(g(z)\) is convex and \(\nabla g\) is \(\beta\)-Lipschitz. Moreover, thanks to the convexity of \(f\), i.e.
\begin{align}
f(z)\geq f(x)+\nabla f(x)^T(z-x),
\end{align}
\(g\) achieves its minimum at \(z=x\).
Since \(\nabla g\) is \(\beta\)-Lipschitz, by the Taylor expansion, for any \(z\),
\begin{align}
g(z)\leq g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2.
\end{align}
By taking the minimum on both sides,
\begin{equation}
\begin{split}
g(x)
&=\min_z\{ g(z)\}\\
&\leq \min_z\left\{g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2\right\}\\
&=g(y)-\frac{1}{2\beta}\|\nabla g(y)\|_2^2,
\end{split}
\end{equation}
which is \eqref{e:sconvex1}.
We prove the stability bound for convex loss minimization via stochastic gradient method.
We couple the two algorithms with data sets \(S\) and \(S’\). At step \(t\), there are two cases: with probability \(1-1/n\), the algorithms choose the same data example; with probability \(1/n\), the algorithms choose different data examples.
If the two algorithms choose the same data example \(z\) at step \(t\), then
\begin{equation}
\begin{split}
\delta_{t+1}^2
=&\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
=&\|w_t-w_t’\|^2 +\alpha_t^2\|\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
&-2\alpha_t\langle w_t-w_t’, \nabla f(w_t, z)-\nabla f(w’_t,z)\rangle\\
\leq& \|w_t-w_t’\|^2=\delta_t^2,
\end{split}
\end{equation}
provided that \(\alpha_t\leq 2/\beta\), where we used \eqref{e:coercive}.
If the two algorithms choose different data examples \(z\) and \(z’\), then
\begin{equation}
\begin{split}
\delta_{t+1}&=\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|_2\\
&\leq \|w_t-w_t’\|_2+ \alpha_t\left(\|\nabla f(w_t, z)\|_2+\|\nabla f(w’_t,z))\|_2\right)\\
&\leq \delta_t+2\alpha_tL.
\end{split}
\end{equation}
Since the first case happens with probability \(1-1/n\), and the second case happens with probability \(1/n\). In average, we have
\begin{equation}
\begin{split}
\mathbb E[\delta_{t+1}]&\leq \left(1-\frac{1}{n}\right)\mathbb E[\delta_t]+
\frac{1}{n}\mathbb E[\delta_t+2\alpha_t L]\\
&=\mathbb E [\delta_t]+\frac{2\alpha_tL}{n}.
\end{split}
\end{equation}
Overall, we have
\begin{align}
\mathbb E[\delta_{T}]\leq\frac{2L}{n}\sum_{i=1}^T\alpha_t,
\end{align}
and
\begin{align}
|f(w_T,z)-f(W’_T,z)|\leq L \|w_T-w_T’\|_2\leq \frac{2L^2}{n}\sum_{i=1}^T\alpha_t.
\end{align}
This finishes the proof.
Transcribed by Vishesh Jain (edited by Asad Lodhia and Elchanan Mossel)
We begin by discussing (some variants of) the PAC (Probably Approximately Correct) learning model introduced by Leslie Valiant. Throughout this section, we will deal with a hypothesis class or concept class, denoted by \(\mathcal{C}\); this is a space of functions \(\mathcal{X}\rightarrow\mathcal{Y}\), where \(\mathcal{X}=\{0,1\}^{n}\) for some \(n\in\mathbb{N}\) and \(\mathcal{Y}\) is some finite set.
As a concrete example, we could consider formalizing the task of classifying whether a \(16\times16\) bitmap depicts a cat, a dog, or neither, in which case we could take \(n\) to be \(256\), \(\mathcal{Y}\) to be \(\{c,d,n\}\) (where \(c\) denotes “Cat”, \(d\) denotes “Dog” and \(n\) denotes “Neither”), and the concept class \(\mathcal{C}\) to consist of a set of “classifiers” i.e. a subset of functions \(\{0,1\}^{256}\rightarrow\{c,d,e\}\).
Before proceeding further, we make a few remarks on the assumptions underlying the PAC learning model. For many applications, the assumption that our samples are i.i.d. samples from a common distribution \(\mathcal{D}\) can be very unrealistic. For instance, if our sample consists of images generated by taking snapshots of a video at a fixed frequency, then we expect to have some correlation between samples. On the other hand, the requirement that a PAC learner should perform well under any distribution on the domain space can be quite restrictive, and indeed, it is the case that proofs of theorems about non-PAC-learnability sometimes involve carefully constructing a “bad” distribution for a given algorithm. We also emphasize that the definition of PAC learnability that we have presented here seeks efficiency in terms of both time complexity as well as sample complexity. This is in contrast to statistical learning theory, where the focus is typically only on sample complexity.
Another notion, central to the definition of PAC learnability as presented above, is that of “realizability” i.e. the assumption that the data is generated by a function in our hypothesis class \(\mathcal{C}\). This can obviously fail to hold in many cases of interest, for instance, when we choose a not-too-big hypothesis class \(\mathcal{C}\) to avoid over-fitting. To deal with such situations we introduce the stronger notion of agnostic PAC learning.
The general theory of PAC learning goes through with arbitrary domains \(\mathcal{X}\) and arbitrary codomains \(\mathcal{Y}\). For example, one could talk about the PAC learnability of a concept class \(\mathcal{C}\) of functions \(\mathbb{R}^{n}\rightarrow\mathbb{R}\). One of the things that changes when going from the discrete to the non-discrete setting is that the “loss” \(P_{(x,y)\sim\mathcal{D}}[h(x)\neq y]\) for \(h\in\mathcal{C}\) becomes too stringent, and often not useful. Instead, it is much more common to replace it with \(\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(h(x),y)]\), where \(L\colon\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}\) is a choice of “loss function”. For instance, when \(\mathcal{Y}=\mathbb{R}^{k}\), common choices of \(L\) include the
\(L^{1}\) loss, \(L(y_{1},y_{2})=||y_{1}-y_{2}||_{L^{1}}\) and the \(L^{2}\) loss, \(L(y_{1},y_{2})=||y_{1}-y_{2}||_{L^{2}}^{2}\).
We conclude this section with an example-sketch of how basic results in PAC learning are proved. Let \(\mathcal{X}=\mathbb{S}^{n-1}\subset\mathbb{R}^{n}\) be the
unit sphere in \(\mathbb{R}^{n}\) and \(\mathcal{Y}=[0,1]\). Consider the concept class \(\mathcal{C}\) consisting of functions \(\mathbb{S}^{n-1}\rightarrow[0,1]\) of the form
\[
h_{w}(x)=\sum_{i=1}^{n}w_{i}x_{i}\qquad\text{with}\qquad w\in\mathbb{S}^{n-1}\subset\mathbb{R}^{n}.
\]
Finally, the loss function \(L\) is the \(L^{2}\) loss (or in this case, the quadratic loss) given by \(L(y_{1},y_{2})=(y_{1}-y_{2})^{2}\). To show that \(\mathcal{C}\) is agnostically PAC learnable, it suffices to exhibit an agnostic PAC learning algorithm \(A\) for \(\mathcal{C}\). We claim that the empirical risk minimizer, given by
\[
\tilde{h}=h_{w^{*}}=\sum_{i=1}^{n}w_{i}^{*}x_{i}
\]
for \(w^{*}=\text{argmin}_{w\in\mathbb{S}^{n-1}}\sum_{i=1}^{m}L(h_{w}(x_{i}),y_{i})\) (where \(m\) is the number of samples) is such a learner. There are two steps involved in showing this: first, we note that \(w^{*}=\arg\min_{w}\sum_{i=1}^{m}(h_{w}(x_{i})-y_{i})^{2}\) has a well-known formula which can be computed in \(\text{poly}(m)\) time; second, assuming that \(\{(x_{i},y_{i})\}_{i=1}^{m}\) are i.i.d. samples from some distribution \(\mathcal{D}\) on \(\mathbb{S}^{n-1}\times[0,1]\), it follows from a standard concentration of measure argument that for \(m=\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})\) samples, the empirical loss \(\frac{1}{m}\sum_{i=1}^{m}(\tilde{h}(x_{i})-y_{i})^{2}\) will be \(\frac{\epsilon}{4}\) close to the true loss \(\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(\tilde{h}(x),y)]\) with probability at least \(1-\delta\) over the samples, and one can use this, along with the compactness of \(\mathbb{S}^{n-1}\times[0,1]\) to show that \(\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(\tilde{h}(x),y)]\leq\inf_{h\in\mathcal{C}}\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(h(x),y)]+\epsilon\).
In this section, we intend to give a flavour of PAC learning results for feedforward neural networks. This is a very young and active area of research, and most theoretical results are negative. However, in this section, our focus will be on positive results, following Livni, Shalev-Shwartz and Shamir. Using the same notation as in the previous section, we work in the setting where \(\mathcal{X}=\{0,1\}^{d}\) and \(\mathcal{Y}=\{\pm1\}\). Our concept classes will be feedforward neural networks depending on a few parameters, and we will denote them by \(\mathcal{C}_{t,n,\sigma,L}\): \(t\) denotes the depth of the network, \(n\) denotes its size, \(\sigma\) denotes the activation function, and \(L\) denotes a constant such that for each neuron in the network, we have \(||w||_{1}+|b|\leq L\), where \(w\) denotes the vector of weights input to the neuron, and \(b\) denotes the bias added by the neuron. Note that the function computed by such a network is real valued—we convert it to a \(\{\pm1\}\) valued function by taking its sign. Moreover, in the learning theorems below, we will make both the “realisability” assumption as well as the “margin” assumption i.e. we assume that our learner receives samples of the form \((x,\text{sign}(h^{*}(x)))\), where \(h^{*}\) is some (real-valued) function belonging to the class \(\mathcal{C}\), and further, \(|h^{*}(x)|\geq1\) for all \(x\in\mathcal{X}\). Finally, we will consider improper learners.
Indeed, a polynomial network of depth \(\ell\) always computes a polynomial function in \(d\) variables of degree at most \(2^{\ell}\). We can view such polynomials as living in a linear space of dimension \(k\), where \(k\leq(d+1)^{2^{\ell}}\) denotes the number of monomials of degree at most \(2^{\ell}\) in \(d\) variables. Since \(\ell\) is constant, \(2^{\ell}\) is also constant. Hence, we are reduced to learning a linear function on \(\text{poly}(d)\) variables, which can be done in \(\text{poly}(d)\) time (for instance, when a convex loss function is used) using standard learning algorithms.
By an argument involving the approximation of the sigmoid function by Chebyshev polynomials, the previous proposition can be used to show the following:
We conclude this brief section by mentioning a recent learning result due to Goel, Kanade, Klivans and Thaler for feedforward neural networks with ReLU gates. In this theorem, we take \(\mathcal{X}=\mathbb{S}^{d-1}\) and \(\mathcal{Y}=[0,1]\). We mention that the theorem actually holds for a stronger notion called reliable learnability. We also note that the \(2^{O(1/\epsilon)}\) factor appearing below precludes this from being a statement about (efficient) PAC-learnability, according to the definition in this lecture.
We now discuss results related to the hardness of learning. In general, hardness results state that it is algorithmically hard to find a solution to a problem, even though it is information theoretically possible to do so. Hardness of learning results are often based on hardness assumptions coming from cryptography or computational complexity. For instance, a typical hardness result might state that if one can learn a class of functions, then one can break a cryptosystem, or one can solve some canonical algorithmic problem (which is believed to be hard). An early result in this direction is the following proposition due to Kearns and Valiant:
In the above result, we are considering public-key cryptography with key \(K=(K_{priv},K_{pub})\), randomized encryption (using the key \(K\), and random bits \(r\)), and deterministic decryption (using the key \(K\)). Weak PAC-learnability relaxes the requirement that \(P_{x\sim\mathcal{D}}[\tilde{h}(x)\neq h(x)]\le \epsilon\) to \(P_{x\sim\mathcal{D}}[\tilde{h}(x)\neq h(x)]<\frac{1}{2}-\frac{1}{n^{c}}\) for some constant \(c\) i.e. the learner performs non-negligibly better than random guessing. Finally, an algorithm \(A\) is said to distinguish between the encryptions of \(0\) and \(1\) if some some universal constant \(c\) \[ |P_{K,r}[A(K_{pub},e_{K,r}(1))=1]-P_{K,r}[A(K_{pub},e_{K,r}(0))=1]|\geq\frac{1}{n^{c}} \] The intuition behind the Kearns-Valiant theorem is that, since encryptions are easy to generate, one can view a collection of encryptions as training examples for learning the decryption function. But since learning the decryption function to a non-negligible advantage amounts to breaking the cryptosystem, it follows that if the cryptosystem is secure, we cannot learn the decryption function. Our focus in this section will be a result of Klivans and Sherstrov which states that PAC learning the concept class \(\mathcal{C}\) of \(2\)-layer feedforward neural networks with the threshold activation function \(\sigma(x)=\boldsymbol{1}_{x\geq0}\) is hard under a certain cryptographic assumption. More precisely, we have the following:
Here, \(f(n)-uSVP\) refers to the unique shortest vector problem on an \(n\)-dimensional lattice, in which the goal is to find the shortest non-zero vector in a given lattice, provided that it is shorter by a factor of at least \(f(n)\) than any other non-parallel vector. We take \(f(n)=\tilde{O}(n^{1.5})\), an approximation factor for which this problem is believed to be hard. The theorem follows from the following result about the hardness of learning the intersection of half-spaces.
Indeed, this result implies the previous one, since half-spaces i.e. subsets of the form \(a_{1}x_{1}+\dots+a_{n}x_{n}-\theta\geq0\) are easily implemented by threshold gates, and so are intersections. Moreover, by an argument due to Regev, the theorem continues to hold even when all the half-spaces involved are “light” i.e. \(\sum_{i=1}^{n}|a_{i}|+|\theta|\) is bounded by some polynomial in \(n\).
To prove the latter theorem, one first uses a simple linearization argument (like the one we used in the previous section) to show that if the intersections of \(n^{\lambda}\) halfspaces in \(n\)-dimensions are weakly PAC-learnable, then for any constant \(c>0\), the intersections of \(n^{c}\) degree-2 polynomial threshold functions (PTFs) i.e. functions of the form \(\boldsymbol{1}_{p(x)\geq0}\) where \(p(x)\) is is a polynomial of degree at most \(2\), is also weakly PAC-learnable. Next, we show that the class of such intersections of degree-2 PTFs contains the decryption function of a cryptosystem whose security is based on the hardness \(\tilde{O}(n^{1.5})-uSVP\). By the Kearns-Valiant result, it follows that if this class is weakly PAC-learnable, the cryptosystem is insecure, and hence \(\tilde{O}(n^{1.5})-uSVP\) is solvable in polynomial time.
We now provide a description of this cryptosystem due to Regev. We begin by taking \(N=2^{8n^{2}}\) and \(m=cn^{2}\) where \(c\) is a universal constant. The private key consists of a real number \(H\in[\sqrt{N},2\sqrt{N}]\), and the public key consists of a vector \((A_{1},\dots,A_{m},i_{0})\) where \(i_{0}\in\{1,\dots,m\}\), and each \(A_{i}\in\{0,\dots,N-1\}\). Moreover, the \(A_{i}’s\) are “almost” integer multiples of \(N/H\) and \(A_{i_{0}}\) is “almost” an odd integer multiple of \(N/H\). To encrypt \(0\), we pick a random subset \(S\subset[m]\) and output \(\sum_{i\in S}A_{i}\mod N\) ; to encrypt 1, we pick a random subset \(S\subset[m]\) and output \(\lfloor\frac{A_{i_{0}}}{2}\rfloor+\sum_{i\in S}A_{i}\mod N\). The idea is that \(0\) will be encrypted to something which is close to an integer multiple of \(N/H\), whereas \(1\) will be encrypted to something which is close to an odd half-integer multiple of \(N/H\). Accordingly, on receiving \(W\in\{0,\dots,N-1\}\), the decryption system outputs \(0\) if \(\{WH/N\}<\frac{1}{4}\) and outputs \(1\) otherwise. Here, \(\{a\}\) for \(a\in\mathbb{R}\) denotes the minimum distance between \(a\) and an integer, where the distance between two real numbers is the absolute value of their difference. Moreover, by a standard argument, we may use the decryption system \(\{AW\}\), where \(A\) is a representation of \(H/N\) to within \(\text{poly}(n)\) fractional bits.
The only thing that remains to be shown is that the decryption system for this cryptosystem can be implemented using the intersection of \(n^{\beta}\) degree-2 PTFs for some constant \(\beta>0\). We have already remarked that we can use a decryption function of the form \(\{AW\}\) where \(A\) has \(k=\text{poly}(n)\) fractional bits. Moreover, since \(W\) is integral, we can ignore the integral part of \(A\). Denoting \(A=.b_{1}b_{2}\dots b_{k}\) and \(W=\sum_{j=0}^{\ell-1}x_{j}2^{j}\) with \(\ell\leq8n^{2}\), we have
\begin{eqnarray*}
\{AW\} & = & \{\sum_{i=1}^{k}\sum_{j=0}^{\ell-1}b_{i}x_{j}2^{j-i}\}\\
& = & \{\sum_{i=1}^{k}\sum_{j=0}^{\min\{\ell-1,i-1\}}b_{i}x_{j}2^{j-i}\}
\end{eqnarray*}
since we can discard those terms \(b_{i}x_{j}2^{j-i}\) which are integers. Denoting \(S(x)=\sum_{i=1}^{k}\sum_{j=0}^{\min\{\ell-1,i-1\}}b_{i}x_{j}2^{j-i}\), we observe that \(S(x)\) is an integral multiple of \(2^{-k}\) which lies between \(0\) and \(k\). Let \(\{[a_{u},b_{u}]\}\) denote the collections of intervals in \([0,k]\) which are not within \(1/4\) of an integer. Note that there are \(k\) such intervals, and that each interval \([a,b]\) can be recognized by the PTF \(\left(S(x)-\frac{a+b}{2}\right)^{2}\leq\left(\frac{b-a}{2}\right)^{2}\). Hence, the negation of such an interval can be recognized by \(\left(S(x)-\frac{a+b}{2}\right)^{2}>\left(\frac{b-a}{2}\right)^{2}\), and by taking the intersection of these \(k\) degree-2 PTFs, we obtain the region of \([0,k]\) which is within \(1/4\) of an integer, which is precisely the region where the decryption system outputs \(0\).
To conclude, we mention some other hardness of learning results in the literature. Assuming that there is no efficient algorithm that never refutes a satisfiable random \(3\)-SAT formula with \(f(n)=n\log^{2}n\) clauses, but refutes most such random formulae, Daniely and Shalev-Shwartz proved that learning the intersection of \(\log^{3}n\) half-spaces is hard. Daniely also proved that agnostically learning a single half-space is hard, assuming that a related computational problem is hard on average. Along the lines of our comments about PAC learning, we note that results all show that there are some distributions and some functions which are hard to learn. It would be interesting to obtain hardness results for an a priori fixed, natural choice of distribution.
]]>Transcribed by Paxton Turner (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)
In the Boolean circuits setup, there is a separation result based on depth of the following form:
Question: Can we prove a similar result for deep nets?
Last time we discussed the following separation result.
Today we will discuss a more elegant treatment of the 2 vs. 3 separation result which yields slightly stronger results.
Danielly’s Model is formed as follows. Let \( x, x’ \) chosen uniformly (i.e., with respect to the Haar measure) from \(S^{d-1}\) and let \(f(x, x’) = g(\langle x, x’ \rangle) \). The nets of depth \(\ell\) are of the form:
\begin{eqnarray*}
\ell = 2 \qquad &w_2^{\text{T}} \sigma( W_1 x + W_2 x’ + b_1) + b_2 \\
\ell = 3 \qquad &w_3^{\text{T}} \sigma (W_2 \sigma(W_1 x + W_2 x’ + b_1) + b_2) + b_3 \\
\vdots \qquad &\vdots
\end{eqnarray*}
on the input \((x, x’)\). Here, we prove the following theorem.
This illustrates the “magic of 3.” Intuitively, we see the power of quadratic models over linear ones, at least for this specific context.
We begin with the following lemma which is of the flavor of universality of depth 1 nets:
Proof of Lemma (Sketch): Repartition the interval, and note that the Lipschitz condition implies the graph doesn’t deviate too far from a straight line.
Proof of the Theorem: We will prove the theorem in the following section and then show a corresponding lower bound.
We first show that the function can be approximated well by a depth 3 net:
\[
\langle x, x’ \rangle = \Bigg\| \frac{x+x’}{2} \Bigg\|_2^2 – \frac{1}{2} = \sum_{i=1}^d \left( \frac{x + x’}{2} \right)^2,
\]
which implies that a 2-layer network approximates \( \langle x, x’ \rangle \) well. We can also compute \( \sigma( \langle x, x’ \rangle) \) as a linear combination of the previous layers. This in combination with the lemma implies the desired result.
We want to show that a depth two net that approximates the function well is
very wide. For the analysis it will be useful to answer the following question:
Question: What is the distribution of \( \langle x, x’ \rangle \)?
Approximation: Rotation invariance implies that
\[
D(\langle x, x’ \rangle) = D'( \langle x, r \rangle ) = D( \langle x, e_1 \rangle) = D(x_1)
\]
Note that this is also true if \( x’ \) is deterministic and \( x \) is random.
The individual components \(\langle x, e_k \rangle\) of a uniform random vector in \( x \in S^{d-1}\) approaches a Gaussian as \(d \to \infty \). Indeed,
\begin{eqnarray*}
d \mu_d(x) &= \text{Vol}\left( (1 – x^2)^{1/2} S^{d-2} \right) \\
&= \frac{\Gamma(d/2)}{\sqrt{\pi} \Gamma((d-1)/2)} \left(1 – x^2 \right)^{\frac{d-3}{2}}
\end{eqnarray*}
which is a t-statistic distribution (and as \(d \to \infty\) converges to a gaussian). The same reasoning applies also if one of the vectors x, x’ is fixed. Therefore, if \(g = \psi ( \langle v, x \rangle, \langle v’, x’ \rangle ) \), then
\[
||g||_2^2 = ||\psi||_{L^2(\mu_d \times \mu_d)}^2,
\]
and if \(f = \phi( \langle x, x’ \rangle ) \), then
\[
||f||_2^2 = ||\phi||_{L^2(\mu_d)}^2 = \int_{S^{d-1} \times S^{d-1}} f^2(x, x’) \, d(m \times m),
\]
where \(d(m \times m)\) indicates the Haar measure.
Orthogonal Polynomials: Define \( P_0 = 1, P_1 = x,\) and
\[P_n = \frac{2n + d – 4}{n + d – 3} \cdot P_{n-1}(x) – \frac{n – 1}{n + d – 3} \cdot P_{n-2}(x)\]
We observe that
\[
||P_n||_2^2 = 1/N_{n,d} \quad \text{where} \quad N_{n,d} = \frac{(2n + d – 2) (n + d – 3)!}{n! (d – 2)!}
\]
and that \( ||P_n||_{\infty} = 1\). Now define
\[
h_n(x, x’) = \frac{P_n(\langle x, x’ \rangle)}{\sqrt{N_{n,d}}} \quad \text{and} \quad L_n^x(x) = h_n(x,x).
\]
The key fact we use is that
\[
\langle L_i^v, L_j^v \rangle = \mathbb{E}[ L_i^v(x) L_j^{v’}(x)] = \mathbb{E}[h_i(\langle v, x \rangle) h_j(\langle v’, x \rangle) ] = \delta_{i,j} P_i(\langle v, v’ \rangle),
\]
which implies
\[
\mathbb{E}[h_n(x, x’) L_i^v(x) L_j^{v’}(x)] = 1(i = j = n) N_{n_1}^{-1/2} P_n(\langle v, v’ \rangle).
\]
To flesh out the above more carefully,
\begin{align*}
\mathbb{E}[h_n(x,x’) L_i^v(x) L_j^{v’}(x’)] &= \mathbb{E}_x\left[L_i^v(x) \mathbb{E}_{x’}[h_n(x,x’) L_j^{v’}(x’)]\right] \\
&= \mathbb{E}_x[L_i^v \delta_{j = n} P_n(v’, x)] \\
&= \mathbb{E}_x\left[L_i^v(x) \frac{L_j^{v’}(x)}{\sqrt{N_{n,d}}}\right]
\end{align*}
Lower Bound Idea: We now outline how to obtain the desired lower bound. Expand \(g(x, x’) = \sum \alpha_i h_i(x,x’) \) and choose \(f\) that is “noise sensitive”, i.e., has a lot of mass at higher Fourier levels. For the rest of the argument assume that \( f = h_n \) for a large \(n\) (though \( h_n \) is not a Lip function, but we can find a Lip function that has a lot of weight on high levels).
We want \( ||g – \sum_{j=1}^m g_j ||_2^2 \) to be large when \(g_j = b_j \sigma( \langle v_j, x \rangle) \geq c \) unless either \(b_j\) is large. Otherwise, there exists \(j\) such that \( \langle g, g_j \rangle \geq \frac{c}{m} \). Note that if \(h_n\) is close to \(\sum_{j = 1}^m g_j \), then we need to have at least one \(j\) where \( \langle g_j, f \rangle \geq \frac{||f||_2^2}{4m}. \)
We now compute \( \langle g_j , f \rangle \). Observe that
\[
g_j = \sum \beta_{k, \ell} L_k^v(x) L_{\ell}^{v’}(x’)
\]
\[
g_j(x,x’) = \psi(\langle v, x \rangle, \langle v’, x’ \rangle)
\]
\[
\sum \beta_{k, \ell}^2 = ||g_j||_2^2.
\]
Therefore,
\[
|\langle g_j, h_n \rangle | = N_{n,d}^{-1/2} |\beta_{n,n} P_n( \langle v, v’ \rangle )| \leq N_{n,d}^{-1/2} |\beta_{n,n}|
\]
so \(m\) is very large or \(||g_j^2||_2^2\) is very large. Informally, we get a bad approximation of \(h_n(\langle x, x’ \rangle) \) unless
Threshold gates are given by
\[
\mathbf{1}\left\{\sum w_i x_i \geq t\right\}.
\]
The basic unit threshold is
\[
f(x) = \text{sgn}(w_0 + \sum_{i = 1}^n w_i x_i).
\]
We assume a distribution that is uniform over \(\{0,1\}^n\).
Meta-Question: Can we use reduction between models (eg, the previous result) to get better separation in the binary case?
Partial answer: Due to complexity-theoretic issues, better than \( \text{poly}(n) \) isn’t possible.
Question: Can we replace new activations into the previous result (or this one)?
Answer: Isn’t written down, but should be straightforward.
Claim 1: \(M\) can be computed depth 2 using \(O(n \log n)\) gates.
Claim 2: Parity can be computed using threshold gates.
Proof of Claim 2: Let
\[
\sum x_i \geq 1 \to b_1
\]
\[
\vdots
\]
\[
\sum x_i \geq n \to b_n.
\]
It can be shown that there exist weights \( w_1, \ldots w_n\) such that
\[
\text{sgn}(\sum w_i b_i) = \oplus x_i.
\]
In fact, this can be done with \(1 \leq w \leq \text{poly}(k)\), though we don’t show this here.
Lower bound ingredients: We now move onto the lower bound.
Lemma: For all but \( \epsilon \) of \(n\)-bit Boolean functions \(f\), there is no depth 2 network of size less than \( o(\epsilon^2 2^n/n) \) that agrees with \(f\) on more than \(.5 + \epsilon\) of the inputs.
Proof sketch of lemma: The number of threshold functions is less than \(2^{O(n^2)}\), while the total number of functions is \(2^{2^n}\). That is, there aren’t too many threshold functions in general. Also, for most Boolean functions \(f\) on \(\{0,1\}^n\), a 2-layer network of size \(O(\frac{\xi 2^n}{n^2})\) does not approximate \(f\) well.
Returning to the proof of the lemma, the idea is to generate a random function, allowing us to use the lemma’s statement that most functions do not approximate \(f\) well. The construction is:
Question: How does a 2-layer look like after fixing randomly almost all bits?
Intuitive answer: Once enough bits are fixed, most gates become constant and die. Then we’re left with a smaller network which can’t approximate \(f\) well.
Main Lemma: A random restriction of a threshold gate will result in a constant function with probability at least
\[
1 – O(\log n / n^{1/2}).
\]
Thus, the number of gates that remain is \( O(n/\log^2(n)) \), which is not enough to represent a random function.
We close this lecture with the definition of PAC-learning . We have a space \(X_n\) and are interested in functions \( f: X_n \to \{0,1\} \). Let \( \mathcal{C} \) be the class of such functions (eg, 2-layer networks).
Here, \(\delta\) is interpreted as a global parameter , and \(\epsilon\) is the probabilistic error .
]]>Transcribed by Julien Edward Clancy (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)
Last lecture we saw the most basic theorem about neural network approximation: that for most activation functions, including any \( \sigma \colon \mathbb{R} \to \mathbb{R} \) such that \( \sigma(x) \to 0 \) as \( x \to -\infty \) and \( \sigma(x) \to 1 \) as \( x \to \infty \), and also including the ReLU function, we can represent almost any function given enough neurons, with small error in any reasonable norm. One of the proofs relied on Fourier analysis. The simpler one was to observe that by scaling and translating to \( \sigma(\lambda x + b) \) for large \( \lambda \) and some \( b \) we can approximate the function
\[
f(x) = \chi_{[a, \infty)}(x)
\]
for every \( a \). By subtracting one of these from another we can approximate the indicator of any interval, and by standard arguments we can: 1) approximate any continuous function in the uniform sense, and 2) approximate any \( L^2 \) function in the \( L^2 \) sense. The last observation can also be adapted to Sobolev spaces, or even more detailed function classes.
An argument can be made that the result above has little to do with the practice of neural networks. Deep networks have gained currency because of their expressive power relative to their size and number of parameters — though the networks used in practice have millions of nodes, they dramatically outpace shallow networks of the same size. The first order of business of this lecture is to take a step towards explaining this.
For the result we’re about to state (from a paper of Telgarsky that can be found here ) let’s consider ReLU networks. Let \( m(x) \) be the piecewise linear function that is \( 2x \) if \( x \in [0, 0.5] \), \( 2-2x \) if \( x \in [0.5, 1] \), and \( 0 \) otherwise. This looks like:
This can be computed exactly by a two- or three-layer network, depending on your notational preferences, by the expression
\[
\sigma( 2 \sigma(x) – 4 \sigma(x – 0.5) )
\]
which is a diamond-shaped network:
or in terms of the activation function:
where the function \( m \) is slightly offset for clarity. Since we’re looking at deep networks, what do iterates of this function look like? Here are the first four:
Each iteration replaces the triangle with two triangles that are half as wide (but of the same height), so \( m^{(n)}(x) \) has \( 2^n \) “teeth”. Moreover, by chaining these networks, we can obviously represent this function with \( 3n+1 \) nodes in a deep network. However, it should take very many nodes to do this in a shallow network. Intuitively, depth increases the number of oscillations multiplicatively (which is why we have an exponential number of teeth) whereas width can only do so additively. The precise (and fairly strong) version of this intuition states:
From this result it is clear that a depth-\(\ell\) and width-\(w\) ReLU network must produce an output that is at most \((2w)^\ell\)-sawtooth. If \(w < 2^{(n-k)/\ell-1}\) then the network computes a \(r \leq 2^{n-k}\)-sawtooth function, whereas \(m^{(n)}\) is precisely \(2^n\)-sawtooth. Now, let's examine each interval on which \(g\) is affine. If the interval contains \(p\) data points, then since \(g\) is affine on this interval it can have the correct sign for at most \(\frac{p+3}{2}\) of them. There are \(2^n\) data points overall, and if we partition them into the \(2^{n-k}\) intervals where \(g\) is constant, then \(g\) can have the correct sign for at most \(2^{n-1} + 3 \cdot 2^{n-k+1}\) of them. This means that \(R(g) > \frac{1}{2} – \frac{1}{3 \cdot 2^{k-1}} \)
It is an instructive exercise to the reader to reproduce the example for the three-class, three-output extension (where the assigned class is taken to be the maximum of the three output values).
More generally we are interested to approximate a function \(f \colon [0,1]^d \to \mathbb{R}\) by a ReLU network \(\tilde{f}\) of depth \(\ell\) and width \(w\), where the output level is linear and we used the squared loss to evaluate network performance. The general theme is to find structural properties of \(f\) that make it hard to approximate by shallow networks — the theorem above was the first such result we’ll see, for a certain notion of oscillatory behavior for function in one dimension. Since deep nets are most used for high dimensional problems, it is interesting to try and classify how well do networks approximated such functions.
Recalling that shallow nets cannot have too many saw-teeth, we first obtain a more general result for quadratic functions:
One nice interpretation of the above result is that the coefficient of the \(x^2\) term is the curvature of the function, so we’ve bounded the approximation capabilities of a neural network in terms of the curvature of the function being approximated.
That result covers quadratic functions. Using the theory of orthogonal polynomials we can extend the result to the case of arbitrary degree polynomials, or analytic functions, or really any \( L^2 \) function. The Legendre polynomials are one such family, starting out as
\[
1, x, \frac{1}{2} \left( 3x^2 – 1 \right), \ldots
\]
This sequence can be generated in many different ways, but one simple way is to do Gram-Schmidt on the set \(\{1, x, x^2, x^3, \ldots\}\) (with the \(L^2([-1,1])\) inner product, with which we will be working for the foreseeable future).
Back to the topic at hand. Since the Legendre polynomials \(h_i\) are an orthonormal family, and \(\{h_0, \ldots, h_n\}\) spans exactly the space of polynomials of degree at most \(n\), the minimum from before becomes the length of an orthogonal projection onto the space of polynomials orthogonal to linear polynomials, and kills exactly the terms in an expansion corresponding to \(h_1\) and \(h_0\), the constant and linear terms. More precisely, if \(f = \sum a_i h_i\), then
\[
\min_{c, d} \int_{-1}^1 \left| f – cx – d \right|^2 = \sum_{i \geq 2} |a_i|^2 \geq |a_2|^2
\]
This gives the same curvature-type result as above, in an \(L^2\) sense. However we can relate this to more canonical properties of a function as follows. We say that a function \(f\) is \(\lambda\)-strongly convex (concave) in \(I\) if \(f”(x) \geq \lambda\) (\( \leq \lambda\)) throughout \(I\). This condition easily implies that \(f'(t) \geq f'(0) + \lambda t\) for \(t \geq 0\), and \(f'(t) \leq f'(0) + \lambda t\) for \(t \leq 0\). For any \(f\) we have
\[
a_2 = \langle f, h_2 \rangle \sim \int_{-1}^1 f(t) (3t^2 – 1) \, dt = – \int_{-1}^1 (t^3 – t) f'(t) \, dt
\]
and if \(f\) is \(\lambda\)-strongly convex this is
\[
– \int_0^1 (t – t^3) f'(t) \, dt + \int_{-1}^0 (t – t^3) f'(t) \, dt
\]
The first integral is at most \(\int_0^1 (t -t^3) (f'(0) + \lambda t) \, dt\), which by some elementary manipulation is at most \(C \lambda\). The other integral is treated identically. Therefore the \(L^2\) bound gives
\[
\lVert f – g \rVert^2 \geq C \frac{\lambda^2}{n^4}
\]
What about higher dimensions? Let \(f \in \mathcal{C}^2([0,1]^d)\). If there is a unit vector \(v\) and a connected set \(U\) such that \(\langle v, H(f)(x) v \rangle \geq \lambda\) on \(U\), where by \(H_x(f)\) we mean the Hessian matrix, then projecting along \(v\) gives strong convexity, and integrating along these one-dimensional slices gives
\[
\lVert f – g \rVert \geq C \frac{\lambda^2}{n^4} \text{Vol}(U)
\]
There’s something a little unsatisfying about this condition, however, which is that in truly high dimensions we expect functions to be intrinsically more complex than in lower dimensions, and their approximation theory should scale accordingly. Neural networks excel at high-dimensional classification in practice, and this success is surely not limited to the type of one-dimensional phenomenon described above. However, let us keep in mind that while we think of the space of, say, \(1024 \times 1024\) images as a million-dimensional vector space, it really is just a space of functions operating on two dimensions: the dimensionality of a sampling of a scene, which is exactly what a camera does, is a different notion than the dimensionality of the space on which the true scene, as a function, operates. This should be contrasted with (say) the case of DNA, where the “true” space actually does have nearly a hundred thousand dimensions.
The remainder of the lecture is an overview of a few results, with the ideas of proofs sketched or absent.
One potential model for approximability of functions (which is not exactly a neural network) is a logical circuit, with function complexity measured in circuit complexity. The idea is that we have a bit string input (in \(\{0,1\}^n\)) as the “input layer”, and at each layer we can pass the bits from the previous layer through the unary NOT or binary AND or OR gates; we call such circuits Boolean. The very strong result we want to compare to is:
This result follows a long line of work in theoretical computer science, in particular work by Hastad. We note, in comparison, that the separation results obtained by Telgarsky do not provide a depth separation between depth \( d \) and \( d – 1 \) like the result above.
As a step in this direction we will discuss a result of Eldan and Shamir showing separation between depths 2 and 3 for pretty general activation functions. More precisely, suppose \(\sigma\) satisfies
\[
|\sigma(x)| \leq C(1 + |x|^\alpha)
\]
for some \(C\) and \(\alpha\), and assume it is sufficiently expressive in the sense that for all Lipschitz \(f \colon \mathbb{R} \to \mathbb{R}\), constant outside \([-R, R]\), and \(\delta > 0\), there exist \(a, \alpha_i, \beta_i, \gamma_i\) such that
\[
\lVert f – a – \sum_i \alpha_i \sigma(\beta_i x – \gamma_i) \rVert_\infty < \delta
\]
Informally, there is a universal lower bound on the approximability of \(3\)-layer networks by \(2\)-layer networks.
Transcribed by Joshua Pfeffer (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)
For most of today’s lecture, we present a non-rigorous review of deep learning; our treatment follows the recent book Deep Learning by Goodfellow, Bengio and Courville.
We begin with the model we study the most, the “quintessential deep learning model”: the deep forward network (Chapter 6 of GBC).
When doing statistics, we begin with a “nature”, or function \(f\); the data is given by \(\left\langle X_i, f(X_i) \right\rangle\), where \(X_i\) is typically high-dimensional and \(f(X_i)\) is in \(\{0,1\}\) or \(\mathbb{R}\). The goal is to find a function \(f^*\) that is close to \(f\) using the given data, hopefully so that you can make accurate predictions.
In deep learning, which is by-and-large a subset of parametric statistics, we have a family of functions
\[
f(X;\theta)
\]
where \(X\) is the input and \(\theta\) the parameter (which is typically high-dimensional). The goal is to find a \(\theta^*\) such that \(f(X;\theta^*)\) is close to \(f\).
In our context, \(\theta\) is the network . The network is a composition of \(d\) functions
\[
f^{(d)}( \cdot, \theta) \circ \cdots \circ f^{(1)}( \cdot, \theta),
\]
most of which will be high-dimensional. We can represent the network with the diagram
where \(h_1^{(i)},\ldots,h_n^{(i)}\) are the components of the vector-valued function \(f^{(i)}\)—also called the \(i\)-th layer of the network—and each \(h_j^{(i)}\) is a function of \((h_1^{(i-1)},\ldots,h_n^{(i-1)})\). In the diagram above, the number of components of each \(f^{(i)}\)—which we call the width of layer \(i\)—is the same, but in general, the width may vary from layer to layer. We call \(d\) the depth of the network. Importantly, the \(d\)-th layer is different from the preceding layers; in particular, its width is \(1\), i.e. , \(f = f^{(d)}\) is scalar-valued.
Now, what statisticians like the most are linear functions. But, if we were to stipulate that the functions \(f^{(i)}\) in our network should be linear, then the composed function \(f\) would be linear as well, eliminating the need for multiple layers altogether. Thus, we want the functions \(f^{(i)}\) to be non-linear.
A common design is motivated by a model from neuroscience, in which a cell receives multiple signals and its synapse either doesn’t fire or fires with a certain intensity depending on the input signals. With the inputs represented by \(x_1,\ldots,x_n \in \mathbb{R}_{+}\), the output in the model is given by
\[
f(x) = g\left( \sum a_i x_i + c \right)
\]
for some non-linear function \(g\). Motivated by this example, we define
\[
h^{(i)} = g^{\otimes}\left( {W^{(i)}}^T x + b^{(i)} \right),
\]
where \(g^{\otimes}\) denotes the coordinate-wise application of some non-linear function \(g\).
Which \(g\) to choose? Generally, people want \(g\) to be the “least non-linear” function possible—hence the common use of the RELU (Rectified Linear Units) function \(g(z) = \text{max}(0,z)\). Other choices of \(g\) (motivated by neuroscience and statistics) include the logistic function \[ g(z) = \frac{1}{1 + e^{-2\beta z}} \] and the hyperbolic tangent \[ g(z) = \tanh(z) = \frac{e^z – e^{-z}}{e^z + e^{-z}}.\] These functions have the advantage of being bounded (unlike the RELU function).
As noted earlier, the top layer has a different form from the preceding ones. First, it is usually-scalar valued. Second, it usually has some statistical interpretation: \(h_1^{(d-1)},\ldots,h_n^{(d-1)}\) are often viewed as parameters of a classical statistical model, informing our choice of \(g\) in the top layer. One example of such a choice is the linear function \(y = W^T h + b\), motivated by thinking of the output as the conditional mean of a Gaussian. Another example is \(\sigma(w^T h + b)\), where \(\sigma\) is the sigmoid function \(x \mapsto \frac{1}{1+e^x}\); this choice is motivated by thinking of the output as a probability of a Bernoulli trial with probability \(P(y)\) proportional to \(\exp(yz)\), where \(z = w^T h + b\). More generally, the soft-max is given by
\[
\text{softmax}(z)_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}
\]
where \(z = W^T h +b\). Here, the components of \(z\) may correspond to the possible values of the output, with \(\text{softmax}(z)_i\) the probability of value \(i\). (We may consider, for example, a network with input a photograph, and interpret the output \((\text{softmax}(z)_1,\text{softmax}(z)_2,\text{softmax}(z)_1)\) as the probabilities that the photograph depicts a cat, dog, or frog.)
In the next few weeks, we will address the questions: 1) How well can such functions approximate general functions? 2) What is the expressive power of depth and width?
Convolution networks , described in Chapter 9 of GBC, are networks with linear operators , namely, localized convolution operators using some underlying grid geometry. For example, consider the network whose \(k\)-th layer can be represented by the \(m \times m\) grid
We then define the function \(h^{(k+1)}_{i,j}\) in layer \(k+1\) by convolving over a \(2 \times 2\) square in the layer below, and then applying the non-linear function \(g\):
\[
h_{i,j}^{(k+1)} = g\left( a^{(k)} h_{i,j}^{(k)} + b^{(k)} h_{i+1,j}^{(k)} + c^{(k)} h_{i,j+1}^{(k)} + d^{(k)} h_{i+1,j+1}^{(k)} \right)
\]
The parameters \(a^{(k)}, b^{(k)}, c^{(k)},\) and \(d^{(k)}\) depend only on the layer, not on the particular square \(i,j\). (This restriction is not necessary from the general definition, but is a reasonable restriction in applications such as vision.) In addition to the advantage of parameter sharing, this type of network has the useful feature of sparsity resulting from the local nature of the definition of the functions \(h\).
A common additional ingredient in convolutional networks is pooling, in which, after convolving and applying \(g\) to obtain the grid-indexed functions \(h_{i,j} ^{(k+1)}\), we replace these function with the average or maximum of the functions in a neighborhood; for example, setting
\[
\overline{h}_{i,j}^{(k+1)} = \frac{1}{4} \left( h_{i,j}^{(k+1)} + h_{i+1,j}^{(k+1)} + h_{i,j+1}^{(k+1)} + h_{i+1,j+1}^{(k+1)} \right).
\]
This technique can also be used to reduce dimension.
The next question we address is, how do we find the parameters for our networks, i.e. , which \(\theta\) do we take? Also, what criteria should we use to evaluate which \(\theta\) are better than others? For this, we usually use statistical modeling. The net \(\theta\) determines a probability distribution \(P(\theta)\); often, we will want to maximize \(P_{\theta}(y|x)\). Equivalently, we will want to minimize \[ J(\theta) = – \mathbb{E} \log P_{\theta}(y|x),\] where the expectation is taken over the data (log likelihood). For example, if we model \(y\) as a Gaussian with mean \(f(x;\theta)\) and covariance matrix the identity, then we want to minimize the mean error cost \[ J(\theta) = \mathbb{E}\left[ \left\| y – f(x;\theta) \right\|^2 \right] \] A second example to consider: \(y\) sampled according to the Bernoulli distribution with probability exponential in \(w^T h + b\), where \(h\) is the last layer; in other words, \(P(y)\) is logistic with parameter \(z = w^T h + b\).
How do you optimize \(J\) both accurately and efficiently? We will not discuss this issue in detail in this course since so far there is not much theory on this question (though such knowledge could land you a lucrative job in the tech sector). What makes this optimization hard is that 1) dimensions are high, 2) the data set is large, 3) \(J\) is non-convex, and 4) there are too many parameters (overfitting). Faced with this task, a natural approach is to do what Newton would do: gradient descent . A slightly more efficient way to do gradient descent (for us, yielding an improvement by a factor of the size of the network) is back-propagation , a method that involves clever bookkeeping of partial derivatives by dynamic programming.
Another technique that we will not discuss (but would help you find remunerative employment in Silicon Valley) is regularization . Regularization addresses the problem of overfitting, an issue summed up well by the quote, due to John Von Neumann, that “with four parameters I can fit an elephant and with five I can make him wiggle his trunk.” The characterization of five parameters as too many might be laughable today, but the problem of overfitting is as real today as ever! Convolution networks provide one solution to the problem of overfitting through parameter sharing. Regularization offers another solution: instead of optimizing \(J(\theta)\), we optimize over \[ \tilde{J}(\theta) = J(\theta) + \Omega(\theta),\] where \(\Omega\) is a “measure of complexity”. Essentially, \(\Omega\) introduces a penalty for “complicated” or “large” parameters. Some example of \(\Omega\) include the \(L_2\) or \(L_1\) (preferred to \(L_0\) for convexity reasons). In the context of deep learning, there are other ways of addressing the problem of overfitting. One is data augmentation , in which the data is used to generate more data; for example, from a given photo, more photos can be generated by rotating the photo or adding shade (A rotated dog is still a dog!). Another is noising, in which noise is added either to the data (for example, by taking a photo and blacking it out) or to the parameters.
There are a number of probabilistic models that are used in deep learning. The first one we describe is an example of a graphical model . Graphical models are families of distributions that are parametrized by graphs, possibly with parameters on the edges. Since deep nets are graphs with parameters on the edges, it is natural to see whether we can express it as a graphical model. A Deep Boltzmann machine is a graphical model whose joint distribution is given by the exponential expression
\[
P(v,h^{(1)},\ldots,h^{(d)}) = \frac{1}{Z(\theta)} \exp\left( E(v,h^{(1)},\ldots,h^{(d)}) \right),
\]
where the energy \(E\) of a configuration is given by
\[
E(v,h^{(1)},\ldots,h^{(d)}) = \sum {h^{(i)}}^T W^{(i+1)} h^{(i+1)} + v^T W^{(1)} h^{(1)}.
\]
Typically, internal layers are real-valued vectors; the top and bottom layers are either discrete or real-valued.
What does this look like as a graph—what is the graphical model here? It is a particular type of bipartite graph, in which the vertices corresponding to each layer is connected only to the layer immediately above it and to the layer immediately below it.
The Markov property says that, for instance, conditional on \(h_1\), the distribution of a component of \(v\) is independent of both \(h_2,\ldots,h_d\) and the other components of \(v\). If \(v\) is discrete,
\[
P[v_i=1| h^{(1)} ] = \sigma( W_{i,*}^{(1)} h^{(1)}),
\]
and similarly for other conditioning.
Unfortunately, we don’t know how to sample from or optimize in graphical models in general, which limits their usefulness in the deep learning context.
Deep belief networks are computationally simpler, though a bit more annoying to define. These “hybrid” networks are essentially a directed graphical model with \(d\) layers, except the top two layers are not directed: \(P(h^{(d-1)},h^{(d)})\) is given by
\begin{equation}
P(h^{(d-1)},h^{(d)}) = Z^{-1} \exp\left( {b^{(d)}}^T h^{(d)} + {b^{(d-1)}}^T h^{(d-1)} + {h^{(d-1)}}^T W^{(d)} h^{(d)} \right)
\label{star}
\end{equation}
For the other layers,
\begin{equation}
P( h^{(k)} | h^{(k+1)}) = \sigma^{\otimes}\left(b^{(k)} + {W^{(k+1)}}^T h^{(k+1)}\right).
\label{starr}
\end{equation}
Note that we are going in the opposite direction that we were before. However, we have the following fact: if \(h^{(k)}, h^{(k+1)} \in \{0,1\}^n\) are defined by (\ref{star}), then they satisfy (\ref{starr}).
Note that we know how to sample the bottom layers conditional on layers directly from above; but for inference we also need the conditional distribution of the output given the input.
Finally, we emphasize that, while the \(k\)-th layer in the deep Boltzmann machine depends on layers \(k+1\) and \(k-1\), in deep belief, if we condition only on layer \(k+1\), we can accurately generate the \(k\)-th layer (not conditional on other layers).
In this class, the main topics we plan to discuss are
The first topic addresses the descriptive power of networks: what functions can be approximated by networks? The papers we plan to discuss are
The first two papers, which we will start to describe later in this lecture, prove that “you can express everything with a single layer” (If you were planning to drop this course, a good time to do so would be after we cover those two papers!). The subsequent papers, however, show that this single layer would have to be very wide, in a sense we will make precise later.
Regarding the second topic, hardness results we discuss in this course will likely include
On the algorithmic side:
Finally, papers on generative models that we will read will include
Again, today we will begin to study the first two papers on the first topic—the papers by Cybenko and Hornik.
In his 1989 paper, Cybenko proved the following result.
In the above theorem, \(C_n([0,1]) = C([0,1]^n)\) is the space of continuous functions from \([0,1]^n\) to \([0,1]\) with the metric \(d(f,g) = \sup|f(x)-g(x)|\).
Hornik proved the following generalization of Cybenko’s result.
In the above Theorem, the space \(L^p(\mu)\) is the space of functions \(f\) with \(\int |f|^p d\mu < \infty\) with metric \(d(f,g) = \left( \int |f-g|^p d\mu \right)^{1/p}\). To begin the proof, we need a crash course in functional analysis.
Why is this theorem useful in our context? Our proof of Cybenko and Hornik’s results is a proof by contradiction using the Hahn-Banach extension theorem. We consider the subspace \(U\) given by \(\{ \sum \alpha_j \sigma(w_j^T x + b_j)\}\), and we assume for contradiction that \(\overline{U}\) is not the entire space of functions. We conclude that there exists a continuous linear map \(L\) on our function space that restricts to \(0\) on \(\overline{U}\) but is not identically zero. In other words, to prove the desired result, it suffices to show that any continuous linear map \(L\) that is zero on \(U\) must be the zero map.
Now, classical results in Functional analysis state that a continuous linear functional \(L\) on \(L^p(\mu)\) can be expressed as
\[
L(f) = \int f g \,d\mu
\]
for some \(g \in L^q(\mu)\), where \(\frac{1}{p} + \frac{1}{q} = 1\). A continuous linear functional \(L\) on \(C(X)\) can be expressed as
\[
L(f) = \int f \,d\mu(x),
\]
where \(\mu\) is a finite signed measure supported on \(X\).
We can find similar expressions for linear functionals on the other spaces considered in the theorems of Cybenko and Hornik.
Before going into the general proof, consider the (easy) example in which our function space is \(L^p(\mu)\) and \(\sigma(x) = \mathbf{1}(x \geq 0)\). How do I show that, if \(L(f) = 0\) for all \(f\) in the set defined in the theorem, then the function \(g \in L^q(\mu)\) associated to \(L\) must be identically zero? By translation, we obtain from \(\sigma\) the indicator of any interval, i.e., we can show that \(\int_a^b g d\mu =0\) for any \(a < b\). Since \(\mu\) is finite (\(\sigma\)-finiteness is enough here), \(g\) must be zero, as desired. Using this example as inspiration, we now consider the general case in the setting of Cybenko's theorem. We want to show that \[ \int_{\mathbb{R}^k} \sigma(w^t x + b) d\mu(x) = 0 \forall w,b \] implies that \(\mu = 0\). First, we reduce to dimension \(1\) using the following Fourier analysis trick: defining the measures \(\mu_a\) as \[ \mu_a(B) = \mu(x: a^t x \in B),\] we observe that \[ \int_R \sigma(w^t x + b) d\mu_a(x) = 0 \] Moreover, if we can show that \(\mu_a \equiv 0\) for each \(a\), then \(\mu \equiv 0\) (``a measure is defined by all of its projections''), since then \[ \hat{\mu}(a) = \int_{\mathbb{R}^k} \exp(i a^t x) d\mu(x) = \int_{\mathbb{R}} \exp(it) d\mu_a(t) = 0 \Rightarrow \hat{\mu} =0 \Rightarrow \mu = 0. \] (Note that we used the finiteness of \(\mu\) here.) Having reduced to dimension \(1\), we employ another very useful trick (that also uses the finiteness of \(\mu\))---the convolution trick. By convolving \(\mu\) with a small Gaussian, we obtain a measure that has a density, letting us work with Lebesgue measure. We now sketch the remainder of the proof. By our convolution trick, we have \begin{equation} \int_{\mathbb{R}^k} \sigma(w^t x + b) h(x) dx = 0 \quad \forall w,b \label{con} \end{equation} and want to show that the density \(h = 0\). Changing variables, we rewrite the condition (\ref{con}) as \[ \int_{\mathbb{R}} \sigma(t) h(wt+b) dt = 0 \quad \forall w \neq 0,b \] To show that \(h=0\), we use the following tool from abstract Fourier analysis. Let \(I\) be the closure of the linear space spanned by all the \(h(wt+b)\). Since \(I\) is invariant under shifting our functions, it follows that it is invariant under convolution; in the language of abstract Fourier analysis, \(I\) is an ideal with respect to convolution. Let \(Z(I)\) denote the set of all \(\omega\) at which the Fourier transform of all functions on \(I\) vanish; then \(Z(I)\) is either \(\mathbb{R}\) or \(\{0\}\) since if \(g(t)\) is in the ideal then so is \(g(t w)\) for \(w \neq 0\). If \(Z(I) = \mathbb{R}\) then all the functions in the ideal are constant \(0\) as needed. Otherwise, \(Z(I) = \{0\}\), then, by Fourier analysis, \(I\) is the set of all functions with \(\hat{f}(0) = 0\); i.e, all non-constant functions. But if \(\sigma\) is orthogonal to all non-constant functions, then it follows that \(\sigma = 0\). We conclude that \(Z(I) = \mathbb{R}\), i.e. , \(h = 0\), completing the proof.
]]>