Transcribed by Jiaoyang Huang (edited by Asad Lodhia, Elchanan Mossel and Matthew Brennan)
For today’s lecture we will present two papers. The first one is Provable Bounds for Learning Some Deep Representations by Arora, Bhaskara, Ge and Ma. They proved that random sparse neural networks are learnable. The second paper is Train faster, generalize better: Stability of stochastic gradient descent by Hardt, Recht and Singer. Their main results state that any model trained with stochastic gradient method in a reasonable amount of time attains small generalization error.
Provable Bounds for Learning Some Deep Representations
The sparse random network model has \(\ell\) layers of vectors of binary variables \(h^{(\ell)}, h^{(\ell-1)},\cdots, h^{(1)}\) (where \(h^{(\ell)}\) is the top layer) and an observed layer \(h^{(0)}\) at bottom. Each layer has \(n\) nodes.
\begin{align}
\mathrm{Depth}=\ell, \qquad \mathrm{Width}=n.
\end{align}
The activation function is the sgn function, i.e.
\begin{align}
\sigma(x)=\left\{
\begin{array}{cc}
1 & x>0,\\
0 & x=0,\\
-1 & x<0.
\end{array}
\right.
\end{align}
The network between two layers is a random graph with average degree \(d=n^{\gamma}\) with \(0<\gamma\leq 0.2\). More precisely, each node at level \(k\) is connected to each node at level \(k-1\) with probability \(n^{\gamma-1}\) independently. The edge weights are random in \(\{\pm 1\}\) with probability \(1/2\) each. For any vertex \(i\) in the \(k\)-th layer, and vertex \(j\) in the \((k-1)\)-th layer, the weight \(W_{ji}^{k-1}\in \{-1,0,1\}\) is such that
\begin{align}
W_{ji}^{k-1}
=\left\{
\begin{array}{cc}
0 & \text{there is no edge between \(i,j\).}\\
-1 & \text{there is an edge with weight \(-1\) between \(i,j\).}\\
1 & \text{there is an edge with weight \(1\) between \(i,j\).}
\end{array}
\right.
\end{align}
We remark that \(W^{k-1}\) is the weighted adjacency matrix between \((k-1)\)-th and \(k\)-th layers.
The samples are generated from top to bottom. Given \(h^{(k)}\), the vector in the \(k\)-th layer, the vector in the \((k-1)\)-th layer is \(h^{(k-1)}=\mathrm{sgn}(W^{k-1}h^{(k)})\),
\begin{align}
h^{(k-1)}_j=\mathrm{sgn}\left(\sum_{i=1}^n W^{k-1}_{ji}h^{(k)}_i\right),\quad j=1,2,\cdots, n.
\end{align}
We sample the top layer \(h^{(\ell)}\), where the set of nodes that are \(1\) is picked uniformly among all sets of size \(\rho_\ell n\). \(\rho_\ell\) is called the density of the top layer, and we assume that \(\rho_\ell (d/2)^\ell=O(1)\). Since averagely, each node has \(d\) children nodes, the density at level \(k\) is roughly of size \(\rho_\ell d^{\ell-k}\ll1\), and \(h^{(k)}\) is also a sparse vector.
The main theorem we will prove is the following:
Thanks to the sparsity, the net should approximately preserve information, and allows easily going back/forth between representations in two adjacent layers. In the following we first focus on learning a single-layer network. It has a top layer with vector \(h^{(1)}\) and a bottom layer with vector \(h^{(0)}\). We assume that \(h^{(1)}\) is sampled among sparse vectors, and then \(h^{(0)}\) is generated using \(h^{(1)}\). To learn a single-layer network, we need first to recover the network structure. Then for each sample, we need to recover \(h^{(1)}\) from \(h^{(0)}\).
Given a single-layer network (with known graph structure), the property that we can recover \(h^{(1)}\) from \(h^{(0)}\) is characterized by the following definition:
We will show that a single-layer random network is a denoising autoencoder if the input layer and noise are both sparse.
Let \(S\) be the support of the top layer \(h^{(1)}\), and \(S’\) be the support of the bottom layer \(h^{(0)}\). The key property for the sparse network is that most neighbors of an activated node are unique. More precisely, fix \(i\) in the top layer, and \(j\in {\cal N}(i)\) in the bottom layer, where \({\cal N}(i)\) is the set of neighbor vertices of \(i\), i.e. \(W_{ji}\neq 0\). Thanks to the sparsity assumption, with high probability, \(j\) has no non-zero neighbor other than \(i\), i.e.
\begin{align}
{\mathbb P}(|{\cal N}(j)\cap S\setminus \{i\}|\geq 1)\leq \rho d<0.1.
\end{align}
If \({\cal N}(j)\cap S\setminus \{i\}=\emptyset\), and \(h_j^{(0)}\) was not flipped by the noise, then \(h_{i}^{(1)}=W_{ji}h_j^{(0)}\). Moreover,
\begin{align}
{\mathbb P}({\cal N}(j)\cap S\setminus \{i\}=\emptyset {\text{ and \(h_j^{(0)}\) was not flipped by the noise}})< 0.2.
\end{align}
Motivated by these observations, we can recover \(h^{(1)}\) using the following quantity,
\begin{align}
I_{i}=\sum_{j\in {\cal N}(i)}W_{ji}h_j^{(0)}.
\end{align}
If \(|I_i|\geq d/2\), then we set \(h_i^{(1)}=sgn(I_i)\). Otherwise, if \(|I_i| < d/2 \), we set \(h_i^{(1)}=0\).
In the following we focus on learning the structure of a single-layer network. We recall that \(h^{(1)}\) is sampled among sparse vectors (with density \(\rho\)), and then \(h^{(0)}\) is generated using \(h^{(1)}\). The graph structure can be recovered using \(\mathrm{poly}(n)\) samples.
\begin{align}
\mathbb E\left[h_{i_1}^{(0)}h_{i_2}^{(0)}\right]\approx \rho.
\end{align}
If \(i_1\not\sim i_2\), then they do not have common neighbors. With good probability, at least one of them is zero
\begin{align}
\mathbb E\left[h_{i_1}^{(0)}h_{i_2}^{(0)}\right]\approx 0.
\end{align}
Once we recovered the \(\sim\) relations, the recovering of the graph structure is straightforward. In fact we can name vertices in the top layer by largest clusters in the bottom layer. More precisely, each vertex in the top layer corresponds to a largest cluster \(A\) in the bottom layer such that \(|A|\geq 0.1d\) and for any \(i,j\in A\), \(i\sim j\). This approach can be strengthened to include some errors by identifying clusters that are different in less than \(0.1d\) nodes, i.e. we identify \(A\) and \(A’\) if \(|A\Delta A’|\leq 0.1 d\).
The high level network inference algorithm in the paper learns the network layer by layer starting from the bottom. For the network between \(i\)-th and \((i+1)\)-th layers, the algorithm first recover the network structure between levels \(i\) and \(i+1\). Then it recovers \(h^{(i+1)}\) from \(h^{(i)}\) for each sample as above.
There is a technical issue. In the original model, we sample \(h^{(\ell)}\) uniformly among all the sparse vectors, then sequentially generate \(h^{(\ell-1)}, h^{(\ell-2)}, \cdots, h^{(0)}\). The bits of \(h^{(i+1)}\) are not independent! We can not directly use the Lemma above to recover the network structure between levels \(i\) and \(i+1\). However, this problem can be resolved by showing that the bits of \(h^{(i+1)}\) cannot be too correlated, unless they have a common ancestor. Moreover, the correlation decays exponentially, i.e.
\begin{align}
\text{grandparents correlation }\leq 1/d \text{ parent correlation.}
\end{align}
Train Faster, Generalize Better: Stability of Stochastic Gradient Descent
Let’s consider the following general setting of supervised learning. There is an unknown distribution \(\cal D\). We receive a sample \(S=(z_1,z_2,\cdots, z_n)\) of \(n\) i.i.d. examples drawn from \(\cal D\). Let \(f(w, z)\) denote the a loss function where \(w\) specifies the model and \(z\) denotes an example. Our goal is to minimize the population risk:
\begin{align}
R[w]:= {\mathbb E}_{z\sim \cal D}[f(w,z)],
\end{align}
over all possible models \(w\). Since the distribution \(\cal D\) is unknown, we cannot measure the quantity \(R[w]\) directly, we instead minimize the empirical risk:
\begin{align}
R_S[w]:=\frac{1}{n}\sum_{i=1}^n f(w,z_i).
\end{align}
The generalization error of a model \(w\) is the difference
\begin{align}
R_S[w]-R[w].
\end{align}
If \(w=A(S)\) is chosen as a function of the data by a possibly random algorithm \(A\), we need to estimate the expected generalization error
\begin{align}
\varepsilon_{gen}:=\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]].
\end{align}
In order to bound the generalization error, we recall the following definition of uniform stability,
\begin{align}
\sup_{z}\mathbb E_{A}[f(A(S),z)-f(A(S’),z)]\leq \varepsilon.
\end{align}
We define \(\varepsilon_{Stab}(A,n)\) the smallest such \(\varepsilon\).
\begin{align}
|\mathbb E_{S,A}[R_S[A(S)]-R[A(S)]]|\leq \varepsilon_{Stab}(A,n).
\end{align}
\begin{align}\begin{split}
\mathbb E_{S}\mathbb E_{A}[R_S[A(S)]]
&=\mathbb E_{S}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S),z_i)\right]\\
&=\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S^{(i)}),z_i’)\right]\\
&=\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n f(A(S),z_i’)\right]+\delta\\
&=\mathbb E_{S}\mathbb E_{A}\left[R[A(S)]\right]+\delta,
\end{split}\end{align}
where
\begin{align}
|\delta|&=\left|\mathbb E_{S}\mathbb E_{S’}\mathbb E_{A}\left[\frac{1}{n}\sum_{i=1}^n \left(f(A(S^{(i)}),z_i’)-f(A(S),z_i’)\right)\right]\right| \\
&\leq \varepsilon_{Stab}(A,n).
\end{align}
This finishes the proof.
In the following we show that for stochastic gradient iterative algorithms, we can control their uniform stability, and obtain some upper bound on the generalization error.
Given the loss function \(f(w,z)\), the stochastic gradient update with learning rate \(\alpha_t>0\) is given by
\begin{align}
w_{t+1}=w_t-\alpha_t \nabla_w f(w_t, z_{i_t})
\end{align}
where the indices \(i_t\) are either picked uniformly randomly in \(\{1,2,\cdots,n\}\), or picked according to a random permutation.
In the rest of the lecture, we analyze the stochastic gradient descent with convex minimization. Let me first discuss a little bit of convexity,
\begin{align}\label{e:coercive}
\langle \nabla f(v)-\nabla f(w), v-w\rangle \geq \frac{1}{\beta}\|\nabla f(v)-\nabla f(w)\|_2^2.
\end{align}
\begin{align}\label{e:sconvex1}
f(y)\geq f(x)+\nabla f(x)^T(y-x)+\frac{1}{2\beta}\|\nabla f(y)-\nabla f(x)\|^2_2.
\end{align}
Then by symmetry, we have
\begin{align}\label{e:sconvex2}
f(x)\geq f(y)+\nabla f(y)^T(x-y)+\frac{1}{2\beta}\|\nabla f(y)-\nabla f(x)\|^2_2.
\end{align}
Our claim \eqref{e:coercive} follows by taking sum of \eqref{e:sconvex1} and \eqref{e:sconvex2}.
For the proof of \eqref{e:sconvex1}, we introduce a new function
\begin{align}
g(z)=f(z)-\nabla f(x)^T z.
\end{align}
Then \(g(z)\) is convex and \(\nabla g\) is \(\beta\)-Lipschitz. Moreover, thanks to the convexity of \(f\), i.e.
\begin{align}
f(z)\geq f(x)+\nabla f(x)^T(z-x),
\end{align}
\(g\) achieves its minimum at \(z=x\).
Since \(\nabla g\) is \(\beta\)-Lipschitz, by the Taylor expansion, for any \(z\),
\begin{align}
g(z)\leq g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2.
\end{align}
By taking the minimum on both sides,
\begin{equation}
\begin{split}
g(x)
&=\min_z\{ g(z)\}\\
&\leq \min_z\left\{g(y)+\nabla g(y)^T(z-y)+\frac{\beta}{2}\|z-y\|_2^2\right\}\\
&=g(y)-\frac{1}{2\beta}\|\nabla g(y)\|_2^2,
\end{split}
\end{equation}
which is \eqref{e:sconvex1}.
We prove the stability bound for convex loss minimization via stochastic gradient method.
\begin{align}
\varepsilon_{Stab}\leq \frac{2L^2}{n}\sum_{t=1}^T\alpha_t.
\end{align}
We couple the two algorithms with data sets \(S\) and \(S’\). At step \(t\), there are two cases: with probability \(1-1/n\), the algorithms choose the same data example; with probability \(1/n\), the algorithms choose different data examples.
If the two algorithms choose the same data example \(z\) at step \(t\), then
\begin{equation}
\begin{split}
\delta_{t+1}^2
=&\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
=&\|w_t-w_t’\|^2 +\alpha_t^2\|\nabla f(w_t, z)-\nabla f(w’_t,z))\|^2_2\\
&-2\alpha_t\langle w_t-w_t’, \nabla f(w_t, z)-\nabla f(w’_t,z)\rangle\\
\leq& \|w_t-w_t’\|^2=\delta_t^2,
\end{split}
\end{equation}
provided that \(\alpha_t\leq 2/\beta\), where we used \eqref{e:coercive}.
If the two algorithms choose different data examples \(z\) and \(z’\), then
\begin{equation}
\begin{split}
\delta_{t+1}&=\|w_t-w_t’ -\alpha_t(\nabla f(w_t, z)-\nabla f(w’_t,z))\|_2\\
&\leq \|w_t-w_t’\|_2+ \alpha_t\left(\|\nabla f(w_t, z)\|_2+\|\nabla f(w’_t,z))\|_2\right)\\
&\leq \delta_t+2\alpha_tL.
\end{split}
\end{equation}
Since the first case happens with probability \(1-1/n\), and the second case happens with probability \(1/n\). In average, we have
\begin{equation}
\begin{split}
\mathbb E[\delta_{t+1}]&\leq \left(1-\frac{1}{n}\right)\mathbb E[\delta_t]+
\frac{1}{n}\mathbb E[\delta_t+2\alpha_t L]\\
&=\mathbb E [\delta_t]+\frac{2\alpha_tL}{n}.
\end{split}
\end{equation}
Overall, we have
\begin{align}
\mathbb E[\delta_{T}]\leq\frac{2L}{n}\sum_{i=1}^T\alpha_t,
\end{align}
and
\begin{align}
|f(w_T,z)-f(W’_T,z)|\leq L \|w_T-w_T’\|_2\leq \frac{2L^2}{n}\sum_{i=1}^T\alpha_t.
\end{align}
This finishes the proof.