Mathematics of Deep Learning: Lecture 5 – Random Sparse Nets etc.

Transcribed by Jiaoyang Huang (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)

For today’s lecture we will present two papers. The first one is Provable Bounds for Learning Some Deep Representations by Arora, Bhaskara, Ge and Ma. They proved that random sparse neural networks are learnable. The second paper is Train faster, generalize better: Stability of stochastic gradient descent by Hardt, Recht and Singer. Their main results state that any model trained with stochastic gradient method in a reasonable amount of time attains small generalization error.

Provable Bounds for Learning Some Deep Representations

The sparse random network model has \(\ell\) layers of vectors of binary variables \(h^{(\ell)}, h^{(\ell-1)},\cdots, h^{(1)}\) (where \(h^{(\ell)}\) is the top layer) and an observed layer \(h^{(0)}\) at bottom. Each layer has \(n\) nodes.
\begin{align}
\mathrm{Depth}=\ell, \qquad \mathrm{Width}=n.
\end{align}
The activation function is the sgn function, i.e.
\begin{align}
\sigma(x)=\left\{
\begin{array}{cc}
1 & x>0,\\
0 & x=0,\\
-1 & x<0. \end{array} \right. \end{align} The network between two layers is a random graph with average degree \(d=n^{\gamma}\) with \(0<\gamma\leq 0.2\). More precisely, each node at level \(k\) is connected to each node at level \(k-1\) with probability \(n^{\gamma-1}\) independently. The edge weights are random in \(\{\pm 1\}\) with probability \(1/2\) each. For any vertex \(i\) in the \(k\)-th layer, and vertex \(j\) in the \((k-1)\)-th layer, the weight \(W_{ji}^{k-1}\in \{-1,0,1\}\) is such that \begin{align} W_{ji}^{k-1} =\left\{ \begin{array}{cc} 0 & \text{there is no edge between \(i,j\).}\\ -1 & \text{there is an edge with weight \(-1\) between \(i,j\).}\\ 1 & \text{there is an edge with weight \(1\) between \(i,j\).} \end{array} \right. \end{align} We remark that \(W^{k-1}\) is the weighted adjacency matrix between \((k-1)\)-th and \(k\)-th layers. The samples are generated from top to bottom. Given \(h^{(k)}\), the vector in the \(k\)-th layer, the vector in the \((k-1)\)-th layer is \(h^{(k-1)}=\mathrm{sgn}(W^{k-1}h^{(k)})\), \begin{align} h^{(k-1)}_j=\mathrm{sgn}\left(\sum_{i=1}^n W^{k-1}_{ji}h^{(k)}_i\right),\quad j=1,2,\cdots, n. \end{align} We sample the top layer \(h^{(\ell)}\), where the set of nodes that are \(1\) is picked uniformly among all sets of size \(\rho_\ell n\). \(\rho_\ell\) is called the density of the top layer, and we assume that \(\rho_\ell (d/2)^\ell=O(1)\). Since averagely, each node has \(d\) children nodes, the density at level \(k\) is roughly of size \(\rho_\ell d^{\ell-k}\ll1\), and \(h^{(k)}\) is also a sparse vector. The main theorem we will prove is the following:

When degree \(d=n^{\gamma}\), for \(0<\gamma\leq 0.2\), with density \(\rho_\ell (d/2)^{\ell}=O(1)\), the network model can be learned with high probability using \(O(\log n/\rho_\ell^2)\) samples and \(O(n^2\ell)\) time.

Thanks to the sparsity, the net should approximately preserve information, and allows easily going back/forth between representations in two adjacent layers. In the following we first focus on learning a single-layer network. It has a top layer with vector \(h^{(1)}\) and a bottom layer with vector \(h^{(0)}\). We assume that \(h^{(1)}\) is sampled among sparse vectors, and then \(h^{(0)}\) is generated using \(h^{(1)}\). To learn a single-layer network, we need first to recover the network structure. Then for each sample, we need to recover \(h^{(1)}\) from \(h^{(0)}\).

Given a single-layer network (with known graph structure), the property that we can recover \(h^{(1)}\) from \(h^{(0)}\) is characterized by the following definition:

Fix an activation function. An autoencoder consists of a decoding function \(D(h)=s(Wh+b)\) and an encoding function \(E(h)=s(W’h+b’)\) where \(W,W’\) are linear transformations, \(b,b’\) are fixed vectors and \(s\) is an activation function, which acts on each coordinate. The autoencoder is said to be well denoising if \(E(D(h)\oplus \xi)=h\) with high probability with respect to the noise \(\xi\) and the randomness of \(h\), where \(D(h)\oplus\xi\) means that \(D(h)\) is corrupted by noise \(\xi\). The autoencoder is called weight tying if \(W’=W^T\).

We will show that a single-layer random network is a denoising autoencoder if the input layer and noise are both sparse.

A one layer network with density \( \rho\) satisfying \(\rho d<0.1\) is a denoising autoencoder with high probability, with respect to noise that flips every output bit independently with probability \(<0.1\). Furthermore this autoencoder is weight tying.
We denote \(W\) the weighted adjacency matrix between the top layer and the bottom layer. By definition, the one layer network implements the encoder \(E(y)=\textrm{sgn}(W^Ty)\). We will show that the one layer network is an autoencoder with decoder \(D(h)=\textrm{sgn}(Wh)\) with high probability. Note that this will also imply it is weight tying.

Let \(S\) be the support of the top layer \(h^{(1)}\), and \(S’\) be the support of the bottom layer \(h^{(0)}\). The key property for the sparse network is that most neighbors of an activated node are unique. More precisely, fix \(i\) in the top layer, and \(j\in {\cal N}(i)\) in the bottom layer, where \({\cal N}(i)\) is the set of neighbor vertices of \(i\), i.e. \(W_{ji}\neq 0\). Thanks to the sparsity assumption, with high probability, \(j\) has no non-zero neighbor other than \(i\), i.e.
\begin{align}
{\mathbb P}(|{\cal N}(j)\cap S\setminus \{i\}|\geq 1)\leq \rho d<0.1. \end{align} If \({\cal N}(j)\cap S\setminus \{i\}=\emptyset\), and \(h_j^{(0)}\) was not flipped by the noise, then \(h_{i}^{(1)}=W_{ji}h_j^{(0)}\). Moreover, \begin{align} {\mathbb P}({\cal N}(j)\cap S\setminus \{i\}=\emptyset {\text{ and \(h_j^{(0)}\) was not flipped by the noise}})< 0.2. \end{align} Motivated by these observations, we can recover \(h^{(1)}\) using the following quantity, \begin{align} I_{i}=\sum_{j\in {\cal N}(i)}W_{ji}h_j^{(0)}. \end{align} If \(|I_i|\geq d/2\), then we set \(h_i^{(1)}=sgn(I_i)\). Otherwise, if \(|I_i| < d/2 \), we set \(h_i^{(1)}=0\).

In the following we focus on learning the structure of a single-layer network. We recall that \(h^{(1)}\) is sampled among sparse vectors (with density \(\rho\)), and then \(h^{(0)}\) is generated using \(h^{(1)}\). The graph structure can be recovered using \(\mathrm{poly}(n)\) samples.

Given \(\mathrm{poly}(n)\) samples, one can recover \(W\) the weighted adjacency matrix (between the top layer \(0\) and \(1\)).
We define a \(\sim\) relation among vertices in the bottom layer: call two nodes \(i_1,i_2\) at the bottom layer related, denoted by \(i_1\sim i_2\), if they are connected to the same node at the top layer. We can identify related nodes by their correlations. If \(i_1\sim i_2\), then they have common neighbors. With probability \(\rho\), one of their common neighbors is non-zero, so
\begin{align}
\mathbb E\left[h_{i_1}^{(0)}h_{i_2}^{(0)}\right]\approx \rho.
\end{align}
If \(i_1\not\sim i_2\), then they do not have common neighbors. With good probability, at least one of them is zero
\begin{align}
\mathbb E\left[h_{i_1}^{(0)}h_{i_2}^{(0)}\right]\approx 0.
\end{align}
Once we recovered the \(\sim\) relations, the recovering of the graph structure is straightforward. In fact we can name vertices in the top layer by largest clusters in the bottom layer. More precisely, each vertex in the top layer corresponds to a largest cluster \(A\) in the bottom layer such that \(|A|\geq 0.1d\) and for any \(i,j\in A\), \(i\sim j\). This approach can be strengthened to include some errors by identifying clusters that are different in less than \(0.1d\) nodes, i.e. we identify \(A\) and \(A’\) if \(|A\Delta A’|\leq 0.1 d\).

The high level network inference algorithm in the paper learns the network layer by layer starting from the bottom. For the network between \(i\)-th and \((i+1)\)-th layers, the algorithm first recover the network structure between levels \(i\) and \(i+1\). Then it recovers \(h^{(i+1)}\) from \(h^{(i)}\) for each sample as above.

There is a technical issue. In the original model, we sample \(h^{(\ell)}\) uniformly among all the sparse vectors, then sequentially generate \(h^{(\ell-1)}, h^{(\ell-2)}, \cdots, h^{(0)}\). The bits of \(h^{(i+1)}\) are not independent! We can not directly use the Lemma above to recover the network structure between levels \(i\) and \(i+1\). However, this problem can be resolved by showing that the bits of \(h^{(i+1)}\) cannot be too correlated, unless they have a common ancestor. Moreover, the correlation decays exponentially, i.e.
\begin{align}
\text{grandparents correlation }\leq 1/d \text{ parent correlation.}
\end{align}

Train Faster, Generalize Better: Stability of Stochastic Gradient Descent

Let’s consider the following general setting of supervised learning. There is an unknown distribution \(\cal D\). We receive a sample \(S=(z_1,z_2,\cdots, z_n)\) of \(n\) i.i.d. examples drawn from \(\cal D\). Let \(f(w, z)\) denote the a loss function where \(w\) specifies the model and \(z\) denotes an example. Our goal is to minimize the population risk:
\begin{align}
R[w]:= {\mathbb E}_{z\sim \cal D}[f(w,z)],
\end{align}
over all possible models \(w\). Since the distribution \(\cal D\) is unknown, we cannot measure the quantity \(R[w]\) directly, we instead minimize the empirical risk:
\begin{align}
R_S[w]:=\frac{1}{n}\sum_{i=1}^n f(w,z_i).
\end{align}
The generalization error of a model \(w\) is the difference
\begin{align}
R_S[w]-R[w].
\end{align}
If \(w=A(S)\) is chosen as a function of the data by a possibly random algorithm \(A\), we need to estimate the expected generalization error
\begin{align}
\varepsilon_{gen}:=\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]].
\end{align}

In order to bound the generalization error, we recall the following definition of uniform stability,

The algorithm \(A\) is \(\varepsilon\)-uniform stable if for all data sets \(S, S’\) that differ at most one example:
\begin{align}
\sup_{z}\mathbb E_{A}[f(A(S),z)-f(A(S’),z)]\leq \varepsilon.
\end{align}
We define \(\varepsilon_{Stab}(A,n)\) the smallest such \(\varepsilon\).
The generalization error can be controlled by \(\varepsilon_{Stab}(A,n)\):
\begin{align}
|\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]]|\leq \varepsilon_{Stab}(A,n).
\end{align}
Denote by \(S=(z_1,z_2,\cdots, z_n)\) and \(S’=(z_1′,z_2′,\cdots,z_n’)\) two independent random samples and let \(S^{(i)}=(z_1,\cdots, z_{i-1}, z_i’,z_{i+1},\cdots,z_n)\). With these notations we have
\begin{align}\begin{split}
\mathbb E_{S}\mathbb E_{A}[R_S[A(S)]]
&=\mathbb E_{S}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S),z_i)\right]\\
&=\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S^{(i)}),z_i’)\right]\\
&=\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S),z_i’)\right]+\delta\\
&=\mathbb E_{S}\mathbb E_{A}\left[R[A(S)]\right]+\delta,
\end{split}\end{align}
where
\begin{align}
|\delta|&=\left|\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n \left(f(A(S^{(i)}),z_i’)-f(A(S),z_i’)\right)\right]\right| \\
&\leq \varepsilon_{Stab}(A,n).
\end{align}
This finishes the proof.

In the following we show that for stochastic gradient iterative algorithms, we can control their uniform stability, and obtain some upper bound on the generalization error.

Given the loss function \(f(w,z)\), the stochastic gradient update with learning rate \(\alpha_t>0\) is given by
\begin{align}
w_{t+1}=w_t-\alpha_t \nabla_w f(w_t, z_{i_t})
\end{align}
where the indices \(i_t\) are either picked uniformly randomly in \(\{1,2,\cdots,n\}\), or picked according to a random permutation.

In the rest of the lecture, we analyze the stochastic gradient descent with convex minimization. Let me first discuss a little bit of convexity,

If \(f\) is convex and \(\nabla f\) is \(\beta\)-Lipschitz then
\begin{align}\label{e:coercive}
\langle \nabla f(v)-\nabla f(w), v-w\rangle \geq \frac{1}{\beta}\|\nabla f(v)-\nabla f(w)\|_2^2.
\end{align}
First we prove that
\begin{align}\label{e:sconvex1}
f(y)\geq f(x)+\nabla f(x)^T(y-x)+\frac{1}{2\beta}\|\nabla f(y)-\nabla f(x)\|^2_2.
\end{align}
Then by symmetry, we have
\begin{align}\label{e:sconvex2}
f(x)\geq f(y)+\nabla f(y)^T(x-y)+\frac{1}{2\beta}\|\nabla f(y)-\nabla f(x)\|^2_2.
\end{align}
Our claim \eqref{e:coercive} follows by taking sum of \eqref{e:sconvex1} and \eqref{e:sconvex2}.

For the proof of \eqref{e:sconvex1}, we introduce a new function
\begin{align}
g(z)=f(z)-\nabla f(x)^T z.
\end{align}
Then \(g(z)\) is convex and \(\nabla g\) is \(\beta\)-Lipschitz. Moreover, thanks to the convexity of \(f\), i.e.
\begin{align}
f(z)\geq f(x)+\nabla f(x)^T(z-x),
\end{align}
\(g\) achieves its minimum at \(z=x\).

Since \(\nabla g\) is \(\beta\)-Lipschitz, by the Taylor expansion, for any \(z\),
\begin{align}
g(z)\leq g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2.
\end{align}
By taking the minimum on both sides,
\begin{equation}
\begin{split}
g(x)
&=\min_z\{ g(z)\}\\
&\leq \min_z\left\{g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2\right\}\\
&=g(y)-\frac{1}{2\beta}\|\nabla g(y)\|_2^2,
\end{split}
\end{equation}
which is \eqref{e:sconvex1}.

We prove the stability bound for convex loss minimization via stochastic gradient method.

Assume that for all \(z\), \(f(\cdot, z)\) is convex and \(L\)-Lipschitz, and \(\nabla f\) is \(\beta\)-Lipschitz. Suppose that we run the stochastic gradient method for \(T\) steps with step sizes \(\alpha_t\leq 2/\beta\), then
\begin{align}
\varepsilon_{Stab}\leq \frac{2L^2}{n}\sum_{t=1}^T\alpha_t.
\end{align}
Let \(S\) and \(S’\) be two samples of size \(n\) differing in only a single example. Let \(\{w_t\}\) and \(\{w_t’\}\) be two runs of the algorithm with data sets \(S\) and \(S’\), and \(\delta_t=\|w_t-w_t’\|_2\).

We couple the two algorithms with data sets \(S\) and \(S’\). At step \(t\), there are two cases: with probability \(1-1/n\), the algorithms choose the same data example; with probability \(1/n\), the algorithms choose different data examples.

If the two algorithms choose the same data example \(z\) at step \(t\), then
\begin{equation}
\begin{split}
\delta_{t+1}^2
=&\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
=&\|w_t-w_t’\|^2 +\alpha_t^2\|\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
&-2\alpha_t\langle w_t-w_t’, \nabla f(w_t, z)-\nabla f(w’_t,z)\rangle\\
\leq& \|w_t-w_t’\|^2=\delta_t^2,
\end{split}
\end{equation}
provided that \(\alpha_t\leq 2/\beta\), where we used \eqref{e:coercive}.

If the two algorithms choose different data examples \(z\) and \(z’\), then
\begin{equation}
\begin{split}
\delta_{t+1}&=\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|_2\\
&\leq \|w_t-w_t’\|_2+ \alpha_t\left(\|\nabla f(w_t, z)\|_2+\|\nabla f(w’_t,z))\|_2\right)\\
&\leq \delta_t+2\alpha_tL.
\end{split}
\end{equation}

Since the first case happens with probability \(1-1/n\), and the second case happens with probability \(1/n\). In average, we have
\begin{equation}
\begin{split}
\mathbb E[\delta_{t+1}]&\leq \left(1-\frac{1}{n}\right)\mathbb E[\delta_t]+
\frac{1}{n}\mathbb E[\delta_t+2\alpha_t L]\\
&=\mathbb E [\delta_t]+\frac{2\alpha_tL}{n}.
\end{split}
\end{equation}
Overall, we have
\begin{align}
\mathbb E[\delta_{T}]\leq\frac{2L}{n}\sum_{i=1}^T\alpha_t,
\end{align}
and
\begin{align}
|f(w_T,z)-f(W’_T,z)|\leq L \|w_T-w_T’\|_2\leq \frac{2L^2}{n}\sum_{i=1}^T\alpha_t.
\end{align}
This finishes the proof.