Mathematics of Deep Learning: Lecture 6 – Simple hierarchical models


Transcribed by Govind Ramnarayan (edited by Mathew Brennan, Asad Lodhia and Elchanan Mossel)

Some Very Simple (Gaussian) Hierarchical Models

We start by describing some very simple hierarchical models on the \(d\)-ary tree . Recall that in the \(d\)-ary tree, each node in the tree has exactly \(d\) children (As opposed to the \(d\)-regular tree, in which the root has \(d\) children, and every other node has \(d-1\) children). Both models will feature an unknown character at the root, chosen according to some distribution. Each node propagates its character to its children with some Gaussian noise. The goal will be to recover the character at the root given the characters at some level \(\ell\).

Version 1 (Brownian Motion)

We will denote the character at the vertex \(v\) by \(X_v\), with the root being denoted as 0.

\(X_0\) will be distributed according to an unknown distribution \(\mu\), with
\[
X_v = X_0 + \sum_{w \in \text{path}(0,v) \backslash \{0\}} \sigma \mathcal{N}_w
\]
where \(\mathcal{N}_w\) are i.i.d. \(N(0,1)\) random variables, and \(\sigma\) is a fixed parameter.

The goal is to estimate \(X_0\) given \(\left\{ X_{v}: v \in L_{\ell}\right\}\), the set of labels at level \(\ell\). However, this model will not be the one we focus on today.

Version 2 (Ornstein-Uhlenbeck)

This is similar to the above model, but instead \(X_0 \sim N(0,1)\) and if \(v\) is a parent of \(w\) then
\[
X_w = \eta X_v + \sqrt{1 – \eta^2} \mathcal{N}_w
\]
where \(\mathcal{N}_w\) are i.i.d \(N(0,1)\).

An important observation about this model is that \(X_w \sim N(0,1)\) for all nodes \(w\) .

Just like in the previous model, we want to estimate \(X_0\) given \(\left\{X_v: v \in L_{\ell}\right\}\). The question is, how should we do it? A natural attempt is to take the average of the children, which we now analyze.

Note that \((X_0, X_v: v \in L_\ell)\) is jointly Gaussian. This means that the estimate that minimizes the \(\ell_2\) error
\[
\mathbb{E}\left[X_0 \Big|(X_v)_{v \in L_{\ell}}\right]
\]
can be written as \(\sum_{v \in L_{\ell}} a_v X_v\), by the definition of being jointly Gaussian. Also, by symmetry of the model, all the \(a_v\)’s are the same, so we can write this as \(a \sum_{v \in L_{\ell}} X_v\).

So to find the correct estimate for \(X_0\), it suffices to find the value of \(a\) that minimizes
\[
\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right]
\]
which we now proceed to do. We begin by computing the mean squared error
\begin{align*}
\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right] &= \mathbb{E}[X_0^2] + a^2\sum \mathbb{E}[X_v^2] – 2ad^{\ell}\text{Cov}(X_0, X_v) + a^2 \sum_{v \neq w \in L_{\ell}} \text{Cov}(X_v, X_w) \\
&= 1 + d^{\ell}a^2 – 2ad^{\ell}\eta^{\ell} + a^2 d^{\ell} \sum\limits_{j=1}^{\ell} \eta^{2j} (d-1)d^{j-1} \\
&= 1 + d^{\ell}\left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot
\frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right)
\end{align*}
The second line follows from expanding out the sum and the fact that there are \(d^{\ell}\) nodes at level \(\ell\). The third line follows from the fact that the covariance of \(X_v\) and \(X_w\) is dependent on the distance between them in the underlying tree: namely, if the distance between them is \(D\), then \(\mathbb{E}[X_v X_w] = \eta^D\). Since all the vertices in the final sum are at the level \(\ell\), they are at an even distance on the tree (the distance is \(2j\), where \(j\) denotes the distance between the vertices and their lowest common ancestor on the tree). The fourth line follows from the formula for computing the sum of a geometric series.

We can compute the value of \(a\) that minimizes
\begin{equation}
1 + d^{\ell} \left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot \frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right)
\end{equation}
by taking the derivative with respect to \(a\) (this can also be accomplished approximately by setting the two terms in the sum to be equal and solving for \(a\)). Doing so, we get that
\begin{equation}
a = \eta^{\ell} \left(\frac{d-1}{d} \cdot \frac{(d\eta^2)^{\ell}}{d\eta^2 – 1} \right)^{-1}
\end{equation}
Now we realize that the optimal squared error \( \mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right] \) at this value of \( a \), as given by the computation above, behaves very differently when \(d \eta^2 < 1\) versus when \(d \eta^2 > 1\). When \(d \eta^2 < 1\), we notice that \(\frac{(d \eta^2)^{\ell + 1} - 1}{(d\eta^2) - 1}\) is approximately a constant. Furthermore, the optimal value of \(a\) is roughly \(\eta^{\ell}\), and so our expression for the optimal squared error becomes \[ \mathbb{E}\left[\left(X_0 - a \sum X_v\right)^2\right] \approx 1 + d^{\ell}a(-2\eta^{\ell} + a \Theta(1)) \approx 1 + d^{\ell}\eta^{2\ell} \approx 1 \] Note also that a squared error of 1 would be accomplished by simply ignoring the leaves and estimating \(X_0\) by 0.

By contrast, when \(d\eta^2 > 1\), we argue that the optimal squared error is less than 1. Plugging in the minimizing value of \(a\) gives that
\begin{align*}
\mathbb{E}\left[\left(X_0 – a \sum X_v\right)^2\right] &= 1 + d^{\ell} \left(-2a\eta^{\ell} + \frac{a^2(d-1)}{d} \cdot \frac{(d \eta^2)^{\ell + 1} – 1}{(d\eta^2) – 1} \right) \\
&= 1 + d^{\ell}(-2a\eta^{\ell} + a\eta^{\ell}) \\
&= 1 – d^{\ell}\eta^{\ell} \left( \eta^{\ell} \frac{d}{d-1} \frac{d\eta^2 – 1}{(d\eta^2)^{\ell + 1} – 1} \right) \\
&\leq 1 – \frac{(d\eta^2)^{\ell}}{(d\eta^2)^{\ell + 1}} \cdot \frac{d}{d-1} \cdot (d\eta^2 – 1) \\
&< 1 \end{align*} In summary, we have a threshold at \(d\eta^2 = 1\) for this simple model. If \(d\eta^2 > 1\), we have shown that we can estimate better than random with the optimal linear estimator. When \(d\eta^2 < 1\), we know that we cannot do better than random. But this model was a linear Gaussian model, which is not at all similar to the type of structure posed by deep learning (specifically, we know that deep learning has non-linearities that are applied at each layer, whereas the entire model here was linear). We’ll introduce some non-linearities in the next model to try to make it more interesting, and analyze the resulting behaviour.

Model 3: “Threshold” Version of Model 2

In this model we deal with the same \(d\)-ary tree as before, but the initial distribution of \(X_0\) and the noise model we use will change.

Specifically, \(X_0\) will be distributed uniformly in \([q] = \{ 1,\ldots, q\}\) and if \(w\) is the parent of \(v\), then
\[
X_v = \begin{cases}
X_w & \text{ w.p. $\eta$} \\
\text{Unif}[q] & \text{ w.p. } 1 – \eta
\end{cases}
\]

An alternative view of the model is that each edge in the \(d\)-ary tree is erased with probability \(1 – \eta\). Then, the root of each component get a random symbol from \([q]\), and within each subtree the character is copied. So the labels within a component are the same, and the labels across components are independent.

Our goal is as follows.

Goal: Estimate \(X_0\) from \(\left(X_v: v \in L_{\ell}\right)\)

Before continuing, we note that the error of any estimator is always at least \(\Theta(1)\) (where we are thinking of \(d\) and \(\eta\) as constants), since there is a chance that there is no copying at the first level! But this was also a problem for Model 2.

A First Attempt (Plurality): One naive estimator is to take the value \(a \in [q]\) that is most common among the \(\left(X_v : v \in L_{\ell}\right)\) (i.e., taking the plurality vote amongst the leaves at level \(\ell\)). This is one of the most natural attempts, but is this the optimal estimator? We now analyse (a variant of) this estimator.

We’ll analyse a variant of the plurality estimator. Let
\[
y_v =
\begin{cases}
1 – 1/q & \text{ if } X_v = 1 \\
-1/q & \text{ if } X_v \neq 1
\end{cases}
\]
The estimator we will use is the following:

  1. If \(S := \sum_{v \in L_\ell} y_v > 0\), then output 1.
  2. Otherwise, output a random character from \(\{ 2,\ldots, q\}\).
The estimator described above outperforms a random estimator if and only if plurality does.

We won’t formally prove this claim now – it will follow trivially from the results in the rest of lecture. However, intuitively \(S\) synchronizes with the plurality estimator when the root symbol is 1, and is random otherwise. If plurality does well, it should do well when the root symbol is 1 by symmetry. It is clear that in this case, \(S\) succeeds whenever plurality succeeds, so it outperforms the random estimator. If \(S\) works correctly, it means that 1 appeared more than \(\frac{1}{q}\) times at the leaves very frequently. The only way plurality could work poorly in this case is if the plurality element is not 1 on many of these occasions. However, by symmetry, every non-root element has lower probability of showing up at the leaves than the root element, so this will only happen if, for many of those times, 1 showed up in just a little more than \(\frac{1}{q}\) fraction of the leaves, but was beaten by a non-root symbol due to random deviation. However, intuitively, in this case 1 would not have showed up more than \(\frac{1}{q}\) times so often, because random deviation would often make it go below \(\frac{1}{q}\). We will indeed see that this is the case in the remainder of lecture.

First we note a trivial claim about when we cannot hope to recover \(X_0\) better than random asymptotically.

[Trivial Lower Bound] If \(d \eta < 1\) no estimator can do better than random asymptotically.
If \(d \eta < 1\), then the fraction of leaves that are connected to the root tends to zero. Therefore, no correlations between the leaves and the root will survive.

How do we analyse this estimator? We’ll use the following three facts:
\[
\mathbb{E}\left[S | X_0 = 1\right] = \mathbb{E}\left[\text{v connected to root}\right] \cdot \left(1 – \frac{1}{q}\right) = (d \eta)^{\ell} \left(1 – \frac{1}{q}\right)
\]
\[
\mathbb{E}\left[S | X_0 \sim U\right] = 0
\]
which follows from the fact that each \(y_v\) is an unbiased random variable. Here \( U \) denotes the uniform distribution on \([q]\). Also,
\[
\text{Var}(S) \approx \Theta(d^{\ell})
\]
which follows from a direct computation very similar to the calculation in the previous model, where we minimized the \(\ell_2\)-error.

From the above three observations, we can conclude that there is some \(\epsilon > 0\) such that
\[
d \eta^2 > 1 \Rightarrow \frac{\mathbb{E}\left[S | X_0 = 1\right] – \mathbb{E}\left[S | X_0 \sim U\right]}{\sqrt{\text{Var}(S)}} > \epsilon
\]
and furthermore that
\[ d\eta^2 < 1 \Rightarrow \frac{\mathbb{E}\left[S | X_0 = 1\right] - \mathbb{E}\left[S | X_0 \sim U\right]}{\sqrt{\text{Var}(S)}} \to 0 \] As we will see, in the case where \(d \eta^2 > 1\), we will be able to recover the root better than random by using the estimator \(S\).

[Second Moment Argument]
If \(\frac{\mathbb{E}\left[S | X_0 = 1\right] – \mathbb{E}\left[S | X_0 \sim U\right]}{\sqrt{\text{Var}(S)}} > \epsilon\), then we can recover \(X_0\) using \(S\) better than random.
Let \(\mu\) denote the measure of \(S\) given a random character at the root, and let \(\mu^+\) denote the measure of \(S\) given that the root is 1. Furthermore, define \(f\) such that \(d\mu^+ = f d\mu\), and of course by definition \(\mu = 1d\mu\). Then
\begin{align*}
\left(\mathbb{E}_{\mu^+}[S] – \mathbb{E}_{\mu}[S]\right)^2 &= \left( \int S (d\mu^+ – d\mu) \right)^2 \\
&= \mathbb{E}_{\mu}\left[S(f-1)\right]^2 \\
&\leq \mathbb{E}_{\mu}\left[S^2\right] \cdot \mathbb{E}_{\mu}\left[(f-1)^2\right]
\end{align*}
by Cauchy-Schwarz. Furthermore, we can upper bound the expression \(\mathbb{E}_{\mu}\left[(f-1)^2\right]\) by noting that \(\mathbb{E}_{\mu}\left[(f-1)^2\right] \leq d_{\text{TV}}(\mu, \mu^+)\), which follows since \((\mu(v) – \mu^+(v))^2 \leq \left|\mu(v) – \mu^+(v)\right|\) for all \(v\). Finally, we can use this to lower bound the total variation distance between \(\mu\) and \(\mu^+\),
\[
d_{\text{TV}}(\mu, \mu^+) \geq \frac{\left(\int S (d\mu -d\mu^+)\right)^2}{\mathbb{E}_{\mu}\left[S^2\right]} \geq \epsilon^2
\]
where in the final step we used our assumption.

In conclusion, this estimator performs very similarly to the scenario for the “boring” case of Model 2! But wait:

  1. We didn’t prove that the estimator fails if \(d \eta^2 < 1\).
  2. Maybe there’s a better estimator. Maybe even one that works for \(d \eta > 1\)?

These questions are addressed by the following two theorems.

  1. If \(q=2\), then \(d\eta^2 = 1\) is the threshold (and \(q=3,4\) behave similarly).
  2. If \(q \geq 5\), then \(d\eta^2\) is not the threshold (but we have no simple formula for the threshold!)

In the theorem below, we define count reconstruction estimators to be estimators that just depend on the number of each color \(q\) among the symbols at the leaves of the tree.

Count reconstruction estimators will always fail for \(d \eta^2 < 1\).

In particular, \(d\eta^2 = 1\) is the threshold for all sane count reconstruction estimators, like the one we have already described.

In the interest of time, we won’t actually prove both of these theorems. We will sketch some easy versions of a “proof” of the first of these two theorems.

[“Proof” that \(d\eta^2\) is not threshold for \(q \geq 5 \)]

Baby version (\(q = \infty\)): We first handle this easier special case. When \( q = \infty \), the labels in the model can be constructed as follows:

  1. Color root randomly (with some arbitrary color).
  2. Copy the color with probability \(\eta\).
  3. Otherwise, choose a completely new color that has not been seen previously and use it.

Now consider the estimator defined as follows: If two nodes \(u, v \in L_{\ell}\) have the root as their lowest common ancestor and \(x_u = x_v\), then return \(x_v\), and otherwise return an arbitrary color.

  1. Correctness: Any repeated color must have been given by a common ancestor in this model. For \(u\) and \(v\) as chosen, this ancestor must have been the root!
  2. Probability of correctness: The probability that we get such leaves is at least the probability that two children of the root survive, and that the branching process survives for each of them.
    \[
    \mathbb{P}[\text{correct}] \geq \eta^2 \cdot \mathbb{P}[\text{branching process survives}]^2 \geq \epsilon
    \]
    where we used the fact that the branching process will survive asymptotically if \(d \eta > 1\).

How do we extend this to large, finite \(q\)? First, we make a definition that will be relevant in our extension.

An \(r\)-diluted \(d\)-ary tree is a tree where every node at level \(\ell\) must have at least \(d\) descendents at level \(\ell + r\) (if that level exists).

The extension to large, finite \(q\) will result from the following two claims:

If \(d \eta > 1\), there exist \(r, \epsilon > 0\) such that with probability at least \(\epsilon\), the root cluster contains an infinite \(r\)-diluted binary tree.
Given \(d, r, \epsilon\), there exists a value \(\eta’ < 1\) such that, for the branching process with copy probability \(\eta'\), the probability that the root cluster contains an infinite \(r\)-diluted \((d^{r}-1)\)-ary tree is at least \(1 - \frac{\epsilon}{10q}\).

We will delay proving these two claims for now. Instead, we proceed to give a good estimator for the case where \(q\) is large and finite, given these claims.

The Estimator: Look for a monochromatic, \(r\)-diluted binary tree of color \(i\). If it exists, return color \(i\).

By the first claim above, if the root is colored with \(i\), then the probability that such a tree exists with color \(i\) is \(\geq \epsilon\). Now suppose that the root is not colored with \(i\). Then we argue that there is no \(r\)-diluted binary tree of color \(i\) with very high probability. Let \(v\) be a child of a node \(w\) in the tree. First we note that
\begin{align*}
\Pr[x_v \neq i | x_w \neq i] &= \eta + (1 – \eta)\left(1 – \frac{1}{q}\right) \\
&= 1 – \frac{1}{q} + \frac{\eta}{q} \underset{q \to
\infty}{\longrightarrow} 1
\end{align*}
By taking \(q\) large enough such that \(1 – \frac{1}{q} + \frac{\eta}{q} \geq \eta’\) where \(\eta’\) is from the second claim above, we see that we cannot fit an \(r\)-diluted binary tree of color \(i\) in the remainder of the tree (since there is no space for the necessary 2 leaves at any level!).

Now we’ll prove the second claim. Before we proceed, we note that this proof was given in detail at the beginning of the proceeding lecture, so the proof here reflects the proof done in that lecture.

[Proof of Second Claim]
We proceed by induction on \(k\), the number of iterations. Let \(p_k\) denote the probability that the Claim holds for a tree of depth \(k \cdot r\) (that is, that the root cluster contains an \(r\)-diluted \((d^{r}-1)\)-ary tree of depth \(k \cdot r\) after percolation has occured for \(k \cdot r\) levels). We will use induction to show that, if \(\eta’\) is large enough, then there is some \(p^* \geq 1 – \epsilon\) such that \(p_k \geq p^*\) for all \(k\).

Define the function \(f\) as follows
\[ f(p) = \Pr[\text{Bin}(d^{r}, p) \geq d^{r} – 1]\]

We can see that \(f(1) = 1\), and furthermore that \(f\) is monotone in \(p\). Furthermore, we can compute \(f(p)\) by simply counting.
\[ f(p) = p^{d^{r}} + d^{r}(1 – p)p^{d^{r} – 1}\]
Now we compute the derivative of \(f\) at \(p=1\):
\[ f'(1) = d^{r} + d^{r}(1 – p)p^{d^{r} – 1} = 0\]
From this, we can conclude that for all sufficiently large \(p^* < 1\), we have that \[ f(p^*) \geq p^* \] We choose a \(p^* > 1 – \epsilon\) and \(\eta’ < 1\) such that \[ q = \Pr[\text{connected to } d^{r} \text{ descendents at distance } r] \geq \frac{p^*}{f(p^*)} \] Finally, we can proceed to show that \(p_k \geq p^*\) for all \(k\) by induction. We note that \(p_0 = 1\), as the \(0\)-depth tree is just the root. Furthermore \begin{align*} p_{r+1} &\geq q \cdot \Pr[\text{Bin}(d^{\ell}, p_r) \geq d^{\ell} - 1] \\ &\geq \frac{p^*}{f(p^*)} \cdot f(p_r) \\ &\geq \frac{p^*}{f(p^*)} f(p^*) = p^* \end{align*} where the final inequality proceeds from the induction hypothesis that \(p_r \geq p^*\) and the monotonicity of \(f\) in \(p\).

For the rest of the lecture, we will sketch the proof of the theorem that count reconstruction estimators will always fail for \(d \eta^2 < 1\) (the word sketch should be re-emphasized here). Informally, this says that if \(d \eta^2 < 1\), then we cannot reconstruct the character at the root better than random simply by looking at the count statistics on the leaves.

[Kesten-Stigum Theorem] If \(d \eta^2 < 1\), then the count vector \((c_1^{\ell}, \ldots, c_q^{\ell})\) satisfies a CLT that does not depend on the root, where \(c_i^{\ell}\) denotes the number of leaves at level \(\ell\) with color \(i\).

As a concrete example, we can use this CLT for the counts to give a CLT for our earlier count estimator \(S\). Let \(\psi(i) = e_1(i) – \frac{(1, \ldots, 1)}{q}\). Then we can see that \(y_v = \psi(x_v)\), and we can apply the Kesten-Stigum Theorem for \(S_{\ell} = \sum_{v \in L_{\ell}} y_v\).

We now give a high level sketch of why the Kesten-Stigum theorem works for this sum. The idea is to use the right martingale. We consider
exposing one node of the tree at a time, and subtracting the “expected value” of the node given its parent to keep the exposed variables zero mean. More formally, define
\[ S_{\ell}’ = \sum_{v: |v| \leq \ell} (d \eta)^{-|v|}(y_v – \eta y_{\text{parent}(v)})\]
where \(|v|\) denotes the level of the vertex \(v\) in the tree (and \(y_{\text{parent}}(\text{root}) = 0\)). Note that \(\mathbb{E}\left[y_v | y_{\text{parent}(v)}\right] = \eta \cdot y_{\text{parent}(v)}\), so whenever we add a vertex to the sum, the expected value of the sum remains the same. By induction, it follows that when we add an entire level to the sum, it remains the same. Hence, we can conclude that

  1. \(S_{\ell}’\) is a martingale.
  2. \(S_{\ell}’ = \sum_{v: |v| = \ell} d^{- \ell} \eta^{- \ell} y_v\)

where the second item follows by induction and noting that the sum telescopes on adjacent levels. We know that \(S_0 = \sum_{v: |v| = 0} (d \eta)^{0} y_v\), so the claim holds for \(\ell = 0\). We now assume that \(S_{\ell}’ = \sum_{v: |v| = \ell} d^{- \ell} \eta^{- \ell} y_v\) and prove the statement for \(\ell + 1\).
\begin{align*}
S_{\ell + 1}’ &= \sum_{v: |v| \leq \ell+1} (d \eta)^{-|v|}(y_v – \eta y_{\text{parent}(v)}) \\
&= S_{\ell} + \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v – \sum_{v: |v| = \ell + 1} (d \eta)^{-\ell – 1} \eta y_{\text{parent}(v)} \\
&= S_{\ell} + \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v – \sum_{v: |v| = \ell} (d \eta)^{\ell} y_v \\
&= \sum_{v: |v| = \ell + 1}(d \eta)^{- \ell – 1}y_v
\end{align*}
Now, we would like to appeal to a Martingale Central Limit Theorem to establish that \(S_{\ell}’\) satisfies a CLT that is independent from the root. Note that Azuma’s inequality would tell us that \(\lim_{\ell \to \infty} S_{\ell}’\) is highly concentrated around \(S_0\) if we knew that \(\sum\limits_{k=0}^{\infty} c_k^2 < \infty\), where the \(c_k\) denote the Martingale differences. In this case, the leaves would not satisfy a CLT independent of the root.
\begin{align*}
\sum_{\ell’ \leq \ell}(S_{\ell’} – S_{\ell’ – 1})^2
&\geq \sum_{\ell’ \leq \ell} d^{\ell’}(d \eta)^{- 2 \ell’} \Theta(1) \\
&= \Theta(1) \sum_{\ell’ \leq \ell} \left(\frac{1}{d\eta^2}\right)^{\ell’}
\end{align*}
If \(d \eta^2 < 1\), this series diverges, which means that $c$ satisfies a CLT independent of the root. However, this is not quite enough; a CLT tells us a statement of the form \(\mathbb{P}[\sum X_i \in [a - c\sqrt{n}, a + c\sqrt{n}]]\), but just knowing that the counts are concentrated in an interval does not rule out the possibility of an estimator that can reconstruct the root from the parity of \(c_1^{\ell}\) (or similar functions that are not constrained by being in an interval). We note that it does not really make sense for any sane estimator to use the parity when trying to reconstruct the root, but we still have to rule it out! So we will additionally need a Local Central Limit Theorem, of the form

[Local CLT] Let \(X_i\) be iid integer valued and not supported on an arithmetic progression with stride \(\geq 2\). Then we have that
\[ \Pr\left[\sum\limits_{i=1}^n X_i = m\right] = \frac{1}{\sigma \sqrt{2 \pi}} \text{exp}\left(-\left(\frac{m – n\mu}{2 \sigma \sqrt{n}}\right)^2\right) + o(1/\sqrt{n})\]
where \(\sigma^2\) is the variance of \(X_i\) and \(\mu\) is the expectation.

The local CLT helps us rule out “parity-like” estimators that still only depend on the counts at the leaves.

Mathematics of Deep Learning: Lecture 5 – Random Sparse Nets etc.

Transcribed by Jiaoyang Huang (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

For today’s lecture we will present two papers. The first one is Provable Bounds for Learning Some Deep Representations by Arora, Bhaskara, Ge and Ma. They proved that random sparse neural networks are learnable. The second paper is Train faster, generalize better: Stability of stochastic gradient descent by Hardt, Recht and Singer. Their main results state that any model trained with stochastic gradient method in a reasonable amount of time attains small generalization error.

Provable Bounds for Learning Some Deep Representations

The sparse random network model has \(\ell\) layers of vectors of binary variables \(h^{(\ell)}, h^{(\ell-1)},\cdots, h^{(1)}\) (where \(h^{(\ell)}\) is the top layer) and an observed layer \(h^{(0)}\) at bottom. Each layer has \(n\) nodes.
\begin{align}
\mathrm{Depth}=\ell, \qquad \mathrm{Width}=n.
\end{align}
The activation function is the sgn function, i.e.
\begin{align}
\sigma(x)=\left\{
\begin{array}{cc}
1 & x>0,\\
0 & x=0,\\
-1 & x<0. \end{array} \right. \end{align} The network between two layers is a random graph with average degree \(d=n^{\gamma}\) with \(0<\gamma\leq 0.2\). More precisely, each node at level \(k\) is connected to each node at level \(k-1\) with probability \(n^{\gamma-1}\) independently. The edge weights are random in \(\{\pm 1\}\) with probability \(1/2\) each. For any vertex \(i\) in the \(k\)-th layer, and vertex \(j\) in the \((k-1)\)-th layer, the weight \(W_{ji}^{k-1}\in \{-1,0,1\}\) is such that \begin{align} W_{ji}^{k-1} =\left\{ \begin{array}{cc} 0 & \text{there is no edge between \(i,j\).}\\ -1 & \text{there is an edge with weight \(-1\) between \(i,j\).}\\ 1 & \text{there is an edge with weight \(1\) between \(i,j\).} \end{array} \right. \end{align} We remark that \(W^{k-1}\) is the weighted adjacency matrix between \((k-1)\)-th and \(k\)-th layers. The samples are generated from top to bottom. Given \(h^{(k)}\), the vector in the \(k\)-th layer, the vector in the \((k-1)\)-th layer is \(h^{(k-1)}=\mathrm{sgn}(W^{k-1}h^{(k)})\), \begin{align} h^{(k-1)}_j=\mathrm{sgn}\left(\sum_{i=1}^n W^{k-1}_{ji}h^{(k)}_i\right),\quad j=1,2,\cdots, n. \end{align} We sample the top layer \(h^{(\ell)}\), where the set of nodes that are \(1\) is picked uniformly among all sets of size \(\rho_\ell n\). \(\rho_\ell\) is called the density of the top layer, and we assume that \(\rho_\ell (d/2)^\ell=O(1)\). Since averagely, each node has \(d\) children nodes, the density at level \(k\) is roughly of size \(\rho_\ell d^{\ell-k}\ll1\), and \(h^{(k)}\) is also a sparse vector. The main theorem we will prove is the following:

When degree \(d=n^{\gamma}\), for \(0<\gamma\leq 0.2\), with density \(\rho_\ell (d/2)^{\ell}=O(1)\), the network model can be learned with high probability using \(O(\log n/\rho_\ell^2)\) samples and \(O(n^2\ell)\) time.

Thanks to the sparsity, the net should approximately preserve information, and allows easily going back/forth between representations in two adjacent layers. In the following we first focus on learning a single-layer network. It has a top layer with vector \(h^{(1)}\) and a bottom layer with vector \(h^{(0)}\). We assume that \(h^{(1)}\) is sampled among sparse vectors, and then \(h^{(0)}\) is generated using \(h^{(1)}\). To learn a single-layer network, we need first to recover the network structure. Then for each sample, we need to recover \(h^{(1)}\) from \(h^{(0)}\).

Given a single-layer network (with known graph structure), the property that we can recover \(h^{(1)}\) from \(h^{(0)}\) is characterized by the following definition:

Fix an activation function. An autoencoder consists of a decoding function \(D(h)=s(Wh+b)\) and an encoding function \(E(h)=s(W’h+b’)\) where \(W,W’\) are linear transformations, \(b,b’\) are fixed vectors and \(s\) is an activation function, which acts on each coordinate. The autoencoder is said to be well denoising if \(E(D(h)\oplus \xi)=h\) with high probability with respect to the noise \(\xi\) and the randomness of \(h\), where \(D(h)\oplus\xi\) means that \(D(h)\) is corrupted by noise \(\xi\). The autoencoder is called weight tying if \(W’=W^T\).

We will show that a single-layer random network is a denoising autoencoder if the input layer and noise are both sparse.

A one layer network with density \( \rho\) satisfying \(\rho d<0.1\) is a denoising autoencoder with high probability, with respect to noise that flips every output bit independently with probability \(<0.1\). Furthermore this autoencoder is weight tying.
We denote \(W\) the weighted adjacency matrix between the top layer and the bottom layer. By definition, the one layer network implements the encoder \(E(y)=\textrm{sgn}(W^Ty)\). We will show that the one layer network is an autoencoder with decoder \(D(h)=\textrm{sgn}(Wh)\) with high probability. Note that this will also imply it is weight tying.

Let \(S\) be the support of the top layer \(h^{(1)}\), and \(S’\) be the support of the bottom layer \(h^{(0)}\). The key property for the sparse network is that most neighbors of an activated node are unique. More precisely, fix \(i\) in the top layer, and \(j\in {\cal N}(i)\) in the bottom layer, where \({\cal N}(i)\) is the set of neighbor vertices of \(i\), i.e. \(W_{ji}\neq 0\). Thanks to the sparsity assumption, with high probability, \(j\) has no non-zero neighbor other than \(i\), i.e.
\begin{align}
{\mathbb P}(|{\cal N}(j)\cap S\setminus \{i\}|\geq 1)\leq \rho d<0.1. \end{align} If \({\cal N}(j)\cap S\setminus \{i\}=\emptyset\), and \(h_j^{(0)}\) was not flipped by the noise, then \(h_{i}^{(1)}=W_{ji}h_j^{(0)}\). Moreover, \begin{align} {\mathbb P}({\cal N}(j)\cap S\setminus \{i\}=\emptyset {\text{ and \(h_j^{(0)}\) was not flipped by the noise}})< 0.2. \end{align} Motivated by these observations, we can recover \(h^{(1)}\) using the following quantity, \begin{align} I_{i}=\sum_{j\in {\cal N}(i)}W_{ji}h_j^{(0)}. \end{align} If \(|I_i|\geq d/2\), then we set \(h_i^{(1)}=sgn(I_i)\). Otherwise, if \(|I_i| < d/2 \), we set \(h_i^{(1)}=0\).

In the following we focus on learning the structure of a single-layer network. We recall that \(h^{(1)}\) is sampled among sparse vectors (with density \(\rho\)), and then \(h^{(0)}\) is generated using \(h^{(1)}\). The graph structure can be recovered using \(\mathrm{poly}(n)\) samples.

Given \(\mathrm{poly}(n)\) samples, one can recover \(W\) the weighted adjacency matrix (between the top layer \(0\) and \(1\)).
We define a \(\sim\) relation among vertices in the bottom layer: call two nodes \(i_1,i_2\) at the bottom layer related, denoted by \(i_1\sim i_2\), if they are connected to the same node at the top layer. We can identify related nodes by their correlations. If \(i_1\sim i_2\), then they have common neighbors. With probability \(\rho\), one of their common neighbors is non-zero, so
\begin{align}
\mathbb E\left[h_{i_1}^{(0)}h_{i_2}^{(0)}\right]\approx \rho.
\end{align}
If \(i_1\not\sim i_2\), then they do not have common neighbors. With good probability, at least one of them is zero
\begin{align}
\mathbb E\left[h_{i_1}^{(0)}h_{i_2}^{(0)}\right]\approx 0.
\end{align}
Once we recovered the \(\sim\) relations, the recovering of the graph structure is straightforward. In fact we can name vertices in the top layer by largest clusters in the bottom layer. More precisely, each vertex in the top layer corresponds to a largest cluster \(A\) in the bottom layer such that \(|A|\geq 0.1d\) and for any \(i,j\in A\), \(i\sim j\). This approach can be strengthened to include some errors by identifying clusters that are different in less than \(0.1d\) nodes, i.e. we identify \(A\) and \(A’\) if \(|A\Delta A’|\leq 0.1 d\).

The high level network inference algorithm in the paper learns the network layer by layer starting from the bottom. For the network between \(i\)-th and \((i+1)\)-th layers, the algorithm first recover the network structure between levels \(i\) and \(i+1\). Then it recovers \(h^{(i+1)}\) from \(h^{(i)}\) for each sample as above.

There is a technical issue. In the original model, we sample \(h^{(\ell)}\) uniformly among all the sparse vectors, then sequentially generate \(h^{(\ell-1)}, h^{(\ell-2)}, \cdots, h^{(0)}\). The bits of \(h^{(i+1)}\) are not independent! We can not directly use the Lemma above to recover the network structure between levels \(i\) and \(i+1\). However, this problem can be resolved by showing that the bits of \(h^{(i+1)}\) cannot be too correlated, unless they have a common ancestor. Moreover, the correlation decays exponentially, i.e.
\begin{align}
\text{grandparents correlation }\leq 1/d \text{ parent correlation.}
\end{align}

Train Faster, Generalize Better: Stability of Stochastic Gradient Descent

Let’s consider the following general setting of supervised learning. There is an unknown distribution \(\cal D\). We receive a sample \(S=(z_1,z_2,\cdots, z_n)\) of \(n\) i.i.d. examples drawn from \(\cal D\). Let \(f(w, z)\) denote the a loss function where \(w\) specifies the model and \(z\) denotes an example. Our goal is to minimize the population risk:
\begin{align}
R[w]:= {\mathbb E}_{z\sim \cal D}[f(w,z)],
\end{align}
over all possible models \(w\). Since the distribution \(\cal D\) is unknown, we cannot measure the quantity \(R[w]\) directly, we instead minimize the empirical risk:
\begin{align}
R_S[w]:=\frac{1}{n}\sum_{i=1}^n f(w,z_i).
\end{align}
The generalization error of a model \(w\) is the difference
\begin{align}
R_S[w]-R[w].
\end{align}
If \(w=A(S)\) is chosen as a function of the data by a possibly random algorithm \(A\), we need to estimate the expected generalization error
\begin{align}
\varepsilon_{gen}:=\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]].
\end{align}

In order to bound the generalization error, we recall the following definition of uniform stability,

The algorithm \(A\) is \(\varepsilon\)-uniform stable if for all data sets \(S, S’\) that differ at most one example:
\begin{align}
\sup_{z}\mathbb E_{A}[f(A(S),z)-f(A(S’),z)]\leq \varepsilon.
\end{align}
We define \(\varepsilon_{Stab}(A,n)\) the smallest such \(\varepsilon\).
The generalization error can be controlled by \(\varepsilon_{Stab}(A,n)\):
\begin{align}
|\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]]|\leq \varepsilon_{Stab}(A,n).
\end{align}
Denote by \(S=(z_1,z_2,\cdots, z_n)\) and \(S’=(z_1′,z_2′,\cdots,z_n’)\) two independent random samples and let \(S^{(i)}=(z_1,\cdots, z_{i-1}, z_i’,z_{i+1},\cdots,z_n)\). With these notations we have
\begin{align}\begin{split}
\mathbb E_{S}\mathbb E_{A}[R_S[A(S)]]
&=\mathbb E_{S}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S),z_i)\right]\\
&=\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S^{(i)}),z_i’)\right]\\
&=\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S),z_i’)\right]+\delta\\
&=\mathbb E_{S}\mathbb E_{A}\left[R[A(S)]\right]+\delta,
\end{split}\end{align}
where
\begin{align}
|\delta|&=\left|\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n \left(f(A(S^{(i)}),z_i’)-f(A(S),z_i’)\right)\right]\right| \\
&\leq \varepsilon_{Stab}(A,n).
\end{align}
This finishes the proof.

In the following we show that for stochastic gradient iterative algorithms, we can control their uniform stability, and obtain some upper bound on the generalization error.

Given the loss function \(f(w,z)\), the stochastic gradient update with learning rate \(\alpha_t>0\) is given by
\begin{align}
w_{t+1}=w_t-\alpha_t \nabla_w f(w_t, z_{i_t})
\end{align}
where the indices \(i_t\) are either picked uniformly randomly in \(\{1,2,\cdots,n\}\), or picked according to a random permutation.

In the rest of the lecture, we analyze the stochastic gradient descent with convex minimization. Let me first discuss a little bit of convexity,

If \(f\) is convex and \(\nabla f\) is \(\beta\)-Lipschitz then
\begin{align}\label{e:coercive}
\langle \nabla f(v)-\nabla f(w), v-w\rangle \geq \frac{1}{\beta}\|\nabla f(v)-\nabla f(w)\|_2^2.
\end{align}
First we prove that
\begin{align}\label{e:sconvex1}
f(y)\geq f(x)+\nabla f(x)^T(y-x)+\frac{1}{2\beta}\|\nabla f(y)-\nabla f(x)\|^2_2.
\end{align}
Then by symmetry, we have
\begin{align}\label{e:sconvex2}
f(x)\geq f(y)+\nabla f(y)^T(x-y)+\frac{1}{2\beta}\|\nabla f(y)-\nabla f(x)\|^2_2.
\end{align}
Our claim \eqref{e:coercive} follows by taking sum of \eqref{e:sconvex1} and \eqref{e:sconvex2}.

For the proof of \eqref{e:sconvex1}, we introduce a new function
\begin{align}
g(z)=f(z)-\nabla f(x)^T z.
\end{align}
Then \(g(z)\) is convex and \(\nabla g\) is \(\beta\)-Lipschitz. Moreover, thanks to the convexity of \(f\), i.e.
\begin{align}
f(z)\geq f(x)+\nabla f(x)^T(z-x),
\end{align}
\(g\) achieves its minimum at \(z=x\).

Since \(\nabla g\) is \(\beta\)-Lipschitz, by the Taylor expansion, for any \(z\),
\begin{align}
g(z)\leq g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2.
\end{align}
By taking the minimum on both sides,
\begin{equation}
\begin{split}
g(x)
&=\min_z\{ g(z)\}\\
&\leq \min_z\left\{g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2\right\}\\
&=g(y)-\frac{1}{2\beta}\|\nabla g(y)\|_2^2,
\end{split}
\end{equation}
which is \eqref{e:sconvex1}.

We prove the stability bound for convex loss minimization via stochastic gradient method.

Assume that for all \(z\), \(f(\cdot, z)\) is convex and \(L\)-Lipschitz, and \(\nabla f\) is \(\beta\)-Lipschitz. Suppose that we run the stochastic gradient method for \(T\) steps with step sizes \(\alpha_t\leq 2/\beta\), then
\begin{align}
\varepsilon_{Stab}\leq \frac{2L^2}{n}\sum_{t=1}^T\alpha_t.
\end{align}
Let \(S\) and \(S’\) be two samples of size \(n\) differing in only a single example. Let \(\{w_t\}\) and \(\{w_t’\}\) be two runs of the algorithm with data sets \(S\) and \(S’\), and \(\delta_t=\|w_t-w_t’\|_2\).

We couple the two algorithms with data sets \(S\) and \(S’\). At step \(t\), there are two cases: with probability \(1-1/n\), the algorithms choose the same data example; with probability \(1/n\), the algorithms choose different data examples.

If the two algorithms choose the same data example \(z\) at step \(t\), then
\begin{equation}
\begin{split}
\delta_{t+1}^2
=&\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
=&\|w_t-w_t’\|^2 +\alpha_t^2\|\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
&-2\alpha_t\langle w_t-w_t’, \nabla f(w_t, z)-\nabla f(w’_t,z)\rangle\\
\leq& \|w_t-w_t’\|^2=\delta_t^2,
\end{split}
\end{equation}
provided that \(\alpha_t\leq 2/\beta\), where we used \eqref{e:coercive}.

If the two algorithms choose different data examples \(z\) and \(z’\), then
\begin{equation}
\begin{split}
\delta_{t+1}&=\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|_2\\
&\leq \|w_t-w_t’\|_2+ \alpha_t\left(\|\nabla f(w_t, z)\|_2+\|\nabla f(w’_t,z))\|_2\right)\\
&\leq \delta_t+2\alpha_tL.
\end{split}
\end{equation}

Since the first case happens with probability \(1-1/n\), and the second case happens with probability \(1/n\). In average, we have
\begin{equation}
\begin{split}
\mathbb E[\delta_{t+1}]&\leq \left(1-\frac{1}{n}\right)\mathbb E[\delta_t]+
\frac{1}{n}\mathbb E[\delta_t+2\alpha_t L]\\
&=\mathbb E [\delta_t]+\frac{2\alpha_tL}{n}.
\end{split}
\end{equation}
Overall, we have
\begin{align}
\mathbb E[\delta_{T}]\leq\frac{2L}{n}\sum_{i=1}^T\alpha_t,
\end{align}
and
\begin{align}
|f(w_T,z)-f(W’_T,z)|\leq L \|w_T-w_T’\|_2\leq \frac{2L^2}{n}\sum_{i=1}^T\alpha_t.
\end{align}
This finishes the proof.

Mathematics of Deep Learning: Lecture 4 – PAC Learning and Deep Nets

Transcribed by Vishesh Jain (edited by Asad Lodhia and Elchanan Mossel)

PAC Learning

We begin by discussing (some variants of) the PAC (Probably Approximately Correct) learning model introduced by Leslie Valiant. Throughout this section, we will deal with a hypothesis class or concept class, denoted by \(\mathcal{C}\); this is a space of functions \(\mathcal{X}\rightarrow\mathcal{Y}\), where \(\mathcal{X}=\{0,1\}^{n}\) for some \(n\in\mathbb{N}\) and \(\mathcal{Y}\) is some finite set.

As a concrete example, we could consider formalizing the task of classifying whether a \(16\times16\) bitmap depicts a cat, a dog, or neither, in which case we could take \(n\) to be \(256\), \(\mathcal{Y}\) to be \(\{c,d,n\}\) (where \(c\) denotes “Cat”, \(d\) denotes “Dog” and \(n\) denotes “Neither”), and the concept class \(\mathcal{C}\) to consist of a set of “classifiers” i.e. a subset of functions \(\{0,1\}^{256}\rightarrow\{c,d,e\}\).

The proper PAC learning task is the following: given \(\mathcal{X},\mathcal{Y},\mathcal{C}\), and a fixed but arbitrary distribution \(\mathcal{D}\) on \(\mathcal{X}\), suppose we are given samples \(\{(x_{i},h(x_{i}))\}_{i=1}^{m}\), where \(h\in\mathcal{C}\) and \(x_{i}\) are i.i.d. samples generated according to the distribution \(\mathcal{D}\). Then, we say that an algorithm \(A\), which has access to the samples \(\{(x_{i},h(x_{i})\}_{i=1}^{m},\mathcal{X}, \mathcal{Y}\) and \( \mathcal{C}\) but no access to \(\mathcal{D}\), is a proper PAC learner for the hypothesis \(h\) under the distribution \(\mathcal{D}\) if for all \(\epsilon,\delta>0\), the algorithm \(A\) (given \(\epsilon,\delta\) as additional inputs) runs in time \(\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})\) and requires \(\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})\) many samples to return a classifier \(\tilde{h}\in\mathcal{C}\) such that with probability at least \(1-\delta\) over the examples
\[
P_{x\sim\mathcal{D}}[\tilde{h}(x)\neq h(x)]\leq\epsilon
\]
Finally, we say that the algorithm \(A\) is a proper PAC learner for the concept class \(\mathcal{C}\) if for every \(h\in\mathcal{C}\) and for every distribution \(\mathcal{D}\) on \(\mathcal{X}\), the above holds. Note that the polynomials in the above definition are not allowed to depend on the distribution \(\mathcal{D}\) or the “correct” hypothesis \(h\).
A slight weakening of the above notion is that of an improper PAC learner, under which the \(\tilde{h}\) returned by the algorithm \(A\) need not belong to the concept class \(\mathcal{C}\) (but all of the other properties of a PAC learner must still hold).

Before proceeding further, we make a few remarks on the assumptions underlying the PAC learning model. For many applications, the assumption that our samples are i.i.d. samples from a common distribution \(\mathcal{D}\) can be very unrealistic. For instance, if our sample consists of images generated by taking snapshots of a video at a fixed frequency, then we expect to have some correlation between samples. On the other hand, the requirement that a PAC learner should perform well under any distribution on the domain space can be quite restrictive, and indeed, it is the case that proofs of theorems about non-PAC-learnability sometimes involve carefully constructing a “bad” distribution for a given algorithm. We also emphasize that the definition of PAC learnability that we have presented here seeks efficiency in terms of both time complexity as well as sample complexity. This is in contrast to statistical learning theory, where the focus is typically only on sample complexity.

Another notion, central to the definition of PAC learnability as presented above, is that of “realizability” i.e. the assumption that the data is generated by a function in our hypothesis class \(\mathcal{C}\). This can obviously fail to hold in many cases of interest, for instance, when we choose a not-too-big hypothesis class \(\mathcal{C}\) to avoid over-fitting. To deal with such situations we introduce the stronger notion of agnostic PAC learning.

The setup is the same as in proper PAC learning, except that instead of assuming that the data is of the form \((x_i,h(x_i))\) for some function \(h\in\mathcal{C}\), we assume that the pairs \((x_{i},y_{i})\) are generated by some distribution \(\mathcal{D}\) on \(\mathcal{X}\times\mathcal{Y}\). Note, in particular, that since the distribution is over \(\mathcal{X}\times\mathcal{Y}\) and not just on \(\mathcal{X}\), we are also allowing for situations when \((x,y)\) and \((x,y’)\) can both have positive mass for \(y\neq y’\) so the data cannot be generated by any function from \(\mathcal{X}\rightarrow\mathcal{Y}\). We define the “quality of optimal approximation of \(\mathcal{D}\) by \(\mathcal{C}\)” by
\[
E(\mathcal{D},\mathcal{C})\colon=\inf_{h\in\mathcal{C}}P_{(x,y)\sim\mathcal{D}}[h(x)\neq y],
\]
and we say that an algorithm \(A\) is an agnostic PAC learner for a concept class \(\mathcal{C}\) if for all \(\epsilon,\delta>0\), the algorithm consumes \(\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})\) time and samples and, with probability at least \(1-\delta\) over the samples, returns \(\tilde{h}\in\mathcal{C}\) such that
\[
P_{(x,y)\sim\mathcal{D}}\left[\tilde{h}(x)\neq y\right]\leq E(\mathcal{D},\mathcal{C})+\epsilon
\]
As before, we can relax agnostic PAC learning to improper agnostic PAC learning. Another relaxation is the notion of \(\alpha\)-agnostic PAC learning for some \(\alpha\geq1\), where everything is exactly
as above, except we only require that
\[
P_{(x,y)\sim\mathcal{D}}\left[\tilde{h}(x)\neq y\right]\leq\alpha E(\mathcal{D},\mathcal{C})+\epsilon
\]

The general theory of PAC learning goes through with arbitrary domains \(\mathcal{X}\) and arbitrary codomains \(\mathcal{Y}\). For example, one could talk about the PAC learnability of a concept class \(\mathcal{C}\) of functions \(\mathbb{R}^{n}\rightarrow\mathbb{R}\). One of the things that changes when going from the discrete to the non-discrete setting is that the “loss” \(P_{(x,y)\sim\mathcal{D}}[h(x)\neq y]\) for \(h\in\mathcal{C}\) becomes too stringent, and often not useful. Instead, it is much more common to replace it with \(\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(h(x),y)]\), where \(L\colon\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}\) is a choice of “loss function”. For instance, when \(\mathcal{Y}=\mathbb{R}^{k}\), common choices of \(L\) include the
\(L^{1}\) loss, \(L(y_{1},y_{2})=||y_{1}-y_{2}||_{L^{1}}\) and the \(L^{2}\) loss, \(L(y_{1},y_{2})=||y_{1}-y_{2}||_{L^{2}}^{2}\).

We conclude this section with an example-sketch of how basic results in PAC learning are proved. Let \(\mathcal{X}=\mathbb{S}^{n-1}\subset\mathbb{R}^{n}\) be the
unit sphere in \(\mathbb{R}^{n}\) and \(\mathcal{Y}=[0,1]\). Consider the concept class \(\mathcal{C}\) consisting of functions \(\mathbb{S}^{n-1}\rightarrow[0,1]\) of the form
\[
h_{w}(x)=\sum_{i=1}^{n}w_{i}x_{i}\qquad\text{with}\qquad w\in\mathbb{S}^{n-1}\subset\mathbb{R}^{n}.
\]
Finally, the loss function \(L\) is the \(L^{2}\) loss (or in this case, the quadratic loss) given by \(L(y_{1},y_{2})=(y_{1}-y_{2})^{2}\). To show that \(\mathcal{C}\) is agnostically PAC learnable, it suffices to exhibit an agnostic PAC learning algorithm \(A\) for \(\mathcal{C}\). We claim that the empirical risk minimizer, given by
\[
\tilde{h}=h_{w^{*}}=\sum_{i=1}^{n}w_{i}^{*}x_{i}
\]
for \(w^{*}=\text{argmin}_{w\in\mathbb{S}^{n-1}}\sum_{i=1}^{m}L(h_{w}(x_{i}),y_{i})\) (where \(m\) is the number of samples) is such a learner. There are two steps involved in showing this: first, we note that \(w^{*}=\arg\min_{w}\sum_{i=1}^{m}(h_{w}(x_{i})-y_{i})^{2}\) has a well-known formula which can be computed in \(\text{poly}(m)\) time; second, assuming that \(\{(x_{i},y_{i})\}_{i=1}^{m}\) are i.i.d. samples from some distribution \(\mathcal{D}\) on \(\mathbb{S}^{n-1}\times[0,1]\), it follows from a standard concentration of measure argument that for \(m=\text{poly}(n,\frac{1}{\epsilon},\frac{1}{\delta})\) samples, the empirical loss \(\frac{1}{m}\sum_{i=1}^{m}(\tilde{h}(x_{i})-y_{i})^{2}\) will be \(\frac{\epsilon}{4}\) close to the true loss \(\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(\tilde{h}(x),y)]\) with probability at least \(1-\delta\) over the samples, and one can use this, along with the compactness of \(\mathbb{S}^{n-1}\times[0,1]\) to show that \(\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(\tilde{h}(x),y)]\leq\inf_{h\in\mathcal{C}}\mathbb{E}_{(x,y)\sim\mathcal{D}}[L(h(x),y)]+\epsilon\).

PAC Learnable Net Models

In this section, we intend to give a flavour of PAC learning results for feedforward neural networks. This is a very young and active area of research, and most theoretical results are negative. However, in this section, our focus will be on positive results, following Livni, Shalev-Shwartz and Shamir. Using the same notation as in the previous section, we work in the setting where \(\mathcal{X}=\{0,1\}^{d}\) and \(\mathcal{Y}=\{\pm1\}\). Our concept classes will be feedforward neural networks depending on a few parameters, and we will denote them by \(\mathcal{C}_{t,n,\sigma,L}\): \(t\) denotes the depth of the network, \(n\) denotes its size, \(\sigma\) denotes the activation function, and \(L\) denotes a constant such that for each neuron in the network, we have \(||w||_{1}+|b|\leq L\), where \(w\) denotes the vector of weights input to the neuron, and \(b\) denotes the bias added by the neuron. Note that the function computed by such a network is real valued—we convert it to a \(\{\pm1\}\) valued function by taking its sign. Moreover, in the learning theorems below, we will make both the “realisability” assumption as well as the “margin” assumption i.e. we assume that our learner receives samples of the form \((x,\text{sign}(h^{*}(x)))\), where \(h^{*}\) is some (real-valued) function belonging to the class \(\mathcal{C}\), and further, \(|h^{*}(x)|\geq1\) for all \(x\in\mathcal{X}\). Finally, we will consider improper learners.

[LSS, ’14] Constant-depth polynomial networks (i.e. the activation function is \(\sigma(y)=y^{2}\)) are learnable in \(\text{poly}(d)\) time under the above assumptions.

Indeed, a polynomial network of depth \(\ell\) always computes a polynomial function in \(d\) variables of degree at most \(2^{\ell}\). We can view such polynomials as living in a linear space of dimension \(k\), where \(k\leq(d+1)^{2^{\ell}}\) denotes the number of monomials of degree at most \(2^{\ell}\) in \(d\) variables. Since \(\ell\) is constant, \(2^{\ell}\) is also constant. Hence, we are reduced to learning a linear function on \(\text{poly}(d)\) variables, which can be done in \(\text{poly}(d)\) time (for instance, when a convex loss function is used) using standard learning algorithms.

By an argument involving the approximation of the sigmoid function by Chebyshev polynomials, the previous proposition can be used to show the following:

[LSS, ’14] The concept class \(\mathcal{C}_{2,n,\sigma_{\text{sigmoid}},L}\) can be learned to accuracy \(\epsilon\) with probability at least \(0.99\) over the samples in time \(\text{poly}(T)\), where \(T=\frac{1}{\epsilon}O(d^{4L\ln(11L^{2}+1)})\).

We conclude this brief section by mentioning a recent learning result due to Goel, Kanade, Klivans and Thaler for feedforward neural networks with ReLU gates. In this theorem, we take \(\mathcal{X}=\mathbb{S}^{d-1}\) and \(\mathcal{Y}=[0,1]\). We mention that the theorem actually holds for a stronger notion called reliable learnability. We also note that the \(2^{O(1/\epsilon)}\) factor appearing below precludes this from being a statement about (efficient) PAC-learnability, according to the definition in this lecture.

[GKKT, ’16] Let \(\mathcal{C}=\{x\mapsto\max(0,w\cdot x)\), \(||w||_{2}\leq1\}\). Then, there is a learning algorithm \(A\) that reliably learns \(\mathcal{C}\) in time \(2^{O(1/\epsilon)}.d^{O(1)}\).

Hardness of Learning

We now discuss results related to the hardness of learning. In general, hardness results state that it is algorithmically hard to find a solution to a problem, even though it is information theoretically possible to do so. Hardness of learning results are often based on hardness assumptions coming from cryptography or computational complexity. For instance, a typical hardness result might state that if one can learn a class of functions, then one can break a cryptosystem, or one can solve some canonical algorithmic problem (which is believed to be hard). An early result in this direction is the following proposition due to Kearns and Valiant:

[KV, ’94] Consider a public-key cryptosystem for encrypting individual bits by \(n\)-bit strings. Let \(\mathcal{C}\) be a concept class that contains all the decryption functions \(d_{K}\colon\{0,1\}^{n}\rightarrow\{0,1\}\) of the cryptosystem, one for each choice of key \(K=(K_{priv},K_{pub})\). Let \(\epsilon(n)=P{}_{K,r}[d_{K}(e_{K,r}(0))\neq0\) or \(d_{K}(e_{K,r}(1))\neq1]\) be the probability of decryption error (over the choice of keys and randomization in the encryption). If \(\mathcal{C}\) is weakly PAC-learnable in time \(t(n)\) with \(t(n)\epsilon(n)=1/n^{\omega(1)}\), then there is a distinguisher between the encryptions of \(0\) and \(1\) that runs in time \(O(t(n))\).

In the above result, we are considering public-key cryptography with key \(K=(K_{priv},K_{pub})\), randomized encryption (using the key \(K\), and random bits \(r\)), and deterministic decryption (using the key \(K\)). Weak PAC-learnability relaxes the requirement that \(P_{x\sim\mathcal{D}}[\tilde{h}(x)\neq h(x)]\le \epsilon\) to \(P_{x\sim\mathcal{D}}[\tilde{h}(x)\neq h(x)]<\frac{1}{2}-\frac{1}{n^{c}}\) for some constant \(c\) i.e. the learner performs non-negligibly better than random guessing. Finally, an algorithm \(A\) is said to distinguish between the encryptions of \(0\) and \(1\) if some some universal constant \(c\) \[ |P_{K,r}[A(K_{pub},e_{K,r}(1))=1]-P_{K,r}[A(K_{pub},e_{K,r}(0))=1]|\geq\frac{1}{n^{c}} \] The intuition behind the Kearns-Valiant theorem is that, since encryptions are easy to generate, one can view a collection of encryptions as training examples for learning the decryption function. But since learning the decryption function to a non-negligible advantage amounts to breaking the cryptosystem, it follows that if the cryptosystem is secure, we cannot learn the decryption function. Our focus in this section will be a result of Klivans and Sherstrov which states that PAC learning the concept class \(\mathcal{C}\) of \(2\)-layer feedforward neural networks with the threshold activation function \(\sigma(x)=\boldsymbol{1}_{x\geq0}\) is hard under a certain cryptographic assumption. More precisely, we have the following:

[KS, ’06] Assume that depth-2 polynomial-size circuits of threshold gates are PAC learnable. Then there is a polynomial-time solution to \(\tilde{O}(n^{1.5})-uSVP\).

Here, \(f(n)-uSVP\) refers to the unique shortest vector problem on an \(n\)-dimensional lattice, in which the goal is to find the shortest non-zero vector in a given lattice, provided that it is shorter by a factor of at least \(f(n)\) than any other non-parallel vector. We take \(f(n)=\tilde{O}(n^{1.5})\), an approximation factor for which this problem is believed to be hard. The theorem follows from the following result about the hardness of learning the intersection of half-spaces.

[KS, ’06] Assume that intersections of \(n^{\lambda}\) halfspaces in \(n\) dimensions are PAC learnable for some constant \(\lambda\) > 0. Then there is a polynomial-time solution to \(\tilde{O}(n^{1.5})-uSVP\).

Indeed, this result implies the previous one, since half-spaces i.e. subsets of the form \(a_{1}x_{1}+\dots+a_{n}x_{n}-\theta\geq0\) are easily implemented by threshold gates, and so are intersections. Moreover, by an argument due to Regev, the theorem continues to hold even when all the half-spaces involved are “light” i.e. \(\sum_{i=1}^{n}|a_{i}|+|\theta|\) is bounded by some polynomial in \(n\).

To prove the latter theorem, one first uses a simple linearization argument (like the one we used in the previous section) to show that if the intersections of \(n^{\lambda}\) halfspaces in \(n\)-dimensions are weakly PAC-learnable, then for any constant \(c>0\), the intersections of \(n^{c}\) degree-2 polynomial threshold functions (PTFs) i.e. functions of the form \(\boldsymbol{1}_{p(x)\geq0}\) where \(p(x)\) is is a polynomial of degree at most \(2\), is also weakly PAC-learnable. Next, we show that the class of such intersections of degree-2 PTFs contains the decryption function of a cryptosystem whose security is based on the hardness \(\tilde{O}(n^{1.5})-uSVP\). By the Kearns-Valiant result, it follows that if this class is weakly PAC-learnable, the cryptosystem is insecure, and hence \(\tilde{O}(n^{1.5})-uSVP\) is solvable in polynomial time.

We now provide a description of this cryptosystem due to Regev. We begin by taking \(N=2^{8n^{2}}\) and \(m=cn^{2}\) where \(c\) is a universal constant. The private key consists of a real number \(H\in[\sqrt{N},2\sqrt{N}]\), and the public key consists of a vector \((A_{1},\dots,A_{m},i_{0})\) where \(i_{0}\in\{1,\dots,m\}\), and each \(A_{i}\in\{0,\dots,N-1\}\). Moreover, the \(A_{i}’s\) are “almost” integer multiples of \(N/H\) and \(A_{i_{0}}\) is “almost” an odd integer multiple of \(N/H\). To encrypt \(0\), we pick a random subset \(S\subset[m]\) and output \(\sum_{i\in S}A_{i}\mod N\) ; to encrypt 1, we pick a random subset \(S\subset[m]\) and output \(\lfloor\frac{A_{i_{0}}}{2}\rfloor+\sum_{i\in S}A_{i}\mod N\). The idea is that \(0\) will be encrypted to something which is close to an integer multiple of \(N/H\), whereas \(1\) will be encrypted to something which is close to an odd half-integer multiple of \(N/H\). Accordingly, on receiving \(W\in\{0,\dots,N-1\}\), the decryption system outputs \(0\) if \(\{WH/N\}<\frac{1}{4}\) and outputs \(1\) otherwise. Here, \(\{a\}\) for \(a\in\mathbb{R}\) denotes the minimum distance between \(a\) and an integer, where the distance between two real numbers is the absolute value of their difference. Moreover, by a standard argument, we may use the decryption system \(\{AW\}\), where \(A\) is a representation of \(H/N\) to within \(\text{poly}(n)\) fractional bits. The only thing that remains to be shown is that the decryption system for this cryptosystem can be implemented using the intersection of \(n^{\beta}\) degree-2 PTFs for some constant \(\beta>0\). We have already remarked that we can use a decryption function of the form \(\{AW\}\) where \(A\) has \(k=\text{poly}(n)\) fractional bits. Moreover, since \(W\) is integral, we can ignore the integral part of \(A\). Denoting \(A=.b_{1}b_{2}\dots b_{k}\) and \(W=\sum_{j=0}^{\ell-1}x_{j}2^{j}\) with \(\ell\leq8n^{2}\), we have
\begin{eqnarray*}
\{AW\} & = & \{\sum_{i=1}^{k}\sum_{j=0}^{\ell-1}b_{i}x_{j}2^{j-i}\}\\
& = & \{\sum_{i=1}^{k}\sum_{j=0}^{\min\{\ell-1,i-1\}}b_{i}x_{j}2^{j-i}\}
\end{eqnarray*}
since we can discard those terms \(b_{i}x_{j}2^{j-i}\) which are integers. Denoting \(S(x)=\sum_{i=1}^{k}\sum_{j=0}^{\min\{\ell-1,i-1\}}b_{i}x_{j}2^{j-i}\), we observe that \(S(x)\) is an integral multiple of \(2^{-k}\) which lies between \(0\) and \(k\). Let \(\{[a_{u},b_{u}]\}\) denote the collections of intervals in \([0,k]\) which are not within \(1/4\) of an integer. Note that there are \(k\) such intervals, and that each interval \([a,b]\) can be recognized by the PTF \(\left(S(x)-\frac{a+b}{2}\right)^{2}\leq\left(\frac{b-a}{2}\right)^{2}\). Hence, the negation of such an interval can be recognized by \(\left(S(x)-\frac{a+b}{2}\right)^{2}>\left(\frac{b-a}{2}\right)^{2}\), and by taking the intersection of these \(k\) degree-2 PTFs, we obtain the region of \([0,k]\) which is within \(1/4\) of an integer, which is precisely the region where the decryption system outputs \(0\).

To conclude, we mention some other hardness of learning results in the literature. Assuming that there is no efficient algorithm that never refutes a satisfiable random \(3\)-SAT formula with \(f(n)=n\log^{2}n\) clauses, but refutes most such random formulae, Daniely and Shalev-Shwartz proved that learning the intersection of \(\log^{3}n\) half-spaces is hard. Daniely also proved that agnostically learning a single half-space is hard, assuming that a related computational problem is hard on average. Along the lines of our comments about PAC learning, we note that results all show that there are some distributions and some functions which are hard to learn. It would be interesting to obtain hardness results for an a priori fixed, natural choice of distribution.