Literature summary

1. On the information bottleneck theory of deep learning (Saxe 2018)

[Saxe2018]

1.1 Key points of the paper

  • none of the following claims of Tishby ([Tishby2015]) holds in the general case:

    1. deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase
    2. the compression phase is causally related to the excellent generalization performance of deep networks
    3. the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent
  • the observed compression is different based on the activation function: double-sided saturating nonlinearities like tanh yield a compression phase, but linear activation functions and single-sided saturating nonlinearities like ReLU do not.

  • there is no evident causal connection between compression and generalization.

  • the compression phase, when it exists, does not arise from stochasticity in training.

  • when an input domain consists of a subset of task-relevant and task-irrelevant information, the task-irrelevant information compress although the overall information about the input may monotonically increase with training time. This compression happens concurrently with the fitting process rather than during a subsequent compression period.

1.2 Most important experiments

  1. Tishby’s experiment reconstructed:

    • 7 fully connected hidden layers of width 12-10-7-5-4-3-2
    • trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input
    • 256 randomly selected samples per batch
    • mutual information is calculated by binning the output activations into 30 equal intervals between -1 and 1
    • trained on Tishby’s dataset
    • tanh-activation function
  2. Tishby’s experiment reconstructed with ReLU activation:

    • 7 fully connected hidden layers of width 12-10-7-5-4-3-2
    • trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input
    • 256 randomly selected samples per batch
    • mutual information is calculated by binning the output activations into 30 equal intervals between -1 and 1
    • ReLu-activation function
  3. Tanh-activation function on MNIST:

    • 6 fully connected hidden layers of width 784 - 1024 - 20 - 20 - 20 - 10
    • trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input
    • non-parametric kernel density mutual information estimator
    • trained on MNIST dataset
    • tanh-activation function
  4. ReLU-activation function on MNIST:

    • 6 fully connected hidden layers of width 784 - 1024 - 20 - 20 - 20 - 10
    • trained with stochastic gradient descent to produce a binary classification from a 12-dimensional input
    • non-parametric kernel density mutual information estimator
    • trained on MNIST dataset
    • ReLU-activation function

2. Estimating mutual information

[Kraskov2004]

2.1 Introduction

  • Kraskov suggests an alternative mutual information estimator that is not based on binning but on k-nearest neighbour distances.

  • Mutual information is often used as a measure of independence between random variables. We note that mutual information is zero if and only if two random variables are strictly independent.

  • Mutual information has some well known properties and advantages since it has close ties to Shannon entropy (see appendix of the paper), still estimating mutual information is not always that easy.

  • Most mutual information estimation techniques are based on binning, which often leads to a systematic error.

  • Consider a set of \(N\) bivariate measurements, \(z_i = (x_i, y_i), i = 1,...,N\), which are assumed to be iid (independent identically distributed) realizations of a random variable \(Z=(X,Y)\) with density \(\mu (x,y)\). \(x\) and \(y\) can be scalars or elements of a higher dimensional space.

  • For simplicity we say that \(0 \cdot \log(0) = 0\) in order to consider probability density functions that do not have to be strictly positive.

  • The marginal densities of \(X\) and \(Y\) can be denoted as follows:

    \[\mu_x(x) = \int \mu (x,y) dy \ \text{and } \ \mu_y(y) = \int \mu (x,y) dx.\]
  • Therefore we can define mutual information as

    \[I(x,y) = \int_Y \int_X \mu (x,y) \cdot \log \dfrac{\mu (x,y)}{\mu_x (x) \mu_y(y)} dx dy.\]
  • Note that the base of the logarithm sets the unit in which information is measured. That means that if we want to measure in bits, we have to take base 2. In the following we will take the natural logarithm for estimating mutual information.

  • Our aim is to estimate mutual information without any knowledge of the probability functions \(\mu\), \(\mu_x\) and \(\mu_y\). The only information we have is set \(\{ z_i \}\).

2.2 Binning

  • Binning is an often used technique to estimate mutual information. Therefore we partition the supports of \(X\) and \(Y\) into bins of finite size by considering the finite sum:

    \[I(X,Y) \approx I_{\text{binned}} (X,Y) \equiv \sum_{i,j} p(i,j) \log \dfrac{p(i,j)}{p_x(i)p_y(j)},\]

    where \(p_x(i) = \int_i \mu_x (x) dx, p_y(j) = \int_j \mu_y(y)\) and \(p(i,j) = \int_i \int_j \mu (x,y) dx dy\) (meaning \(\int_i\) is the integral over bin \(i\)).

  • Set \(n_x(i)\) to be the number of points falling into bin i of \(X\) and analogous to that set \(n_y(j)\) to be the number of points falling into bin j of \(Y\). Moreover, \(n(i,j)\) is the number of points in their intersection.

  • Since we do not know the exact probability density function, we approximate them with \(p_x(i) \approx \frac{n_x(i)}{N}\), \(p_y(j) \approx \frac{n_y(j)}{N}\), and \(p(i,j) \approx \frac{n(i,j)}{N}\).

  • For \(N \rightarrow \infty\) and bin sizes tending to zero, the binning approximation (\(I_{\text{binned}}\)) indeed converges to \(I(X,Y)\). Constraint: all densities exist as proper functions.

  • Note that the bin size do not have to be the same for each bin. Adaptive bin sizes actually lead to much better estimations.

2.3 Kraskov estimator

  • The Kraskov estimator uses k-nearest neighbour statistics to estimate mutual information.

  • The basic idea is to estimate \(H(X)\) from the average distance to the k-nearest neighbour, averaged over all \(x_i\).

  • Since mutual information between two random variables can also be written as

    \[I(X,Y) = H(X) + H(Y) - H(X,Y),\]

    with \(H(X)= - \int \mu (x) \log \mu (x) dx\) being the Shannon entropy, we can estimate the mutual information by estimating the Shannon entropy for \(H(X)\), \(H(Y)\) and \(H(X,Y)\). This estimation would mean that the errors made in the individual estimates would presumably not cancel. Therefore, we proceed a bit differently:

  • Assume some metrics to be given on the spaces by \(X, Y\) and \(Z=(X,Y)\).

  • For each point \(z_i=(x_i,y_i)\) we rank its neighbours by distance \(d_{i,j} = ||z_i - z_j||: d_{i,j_1} \leq d_{i,j_2} \leq d_{i,j_3} \leq ...\). Similar rankings can be done in the subspaces \(X\) and \(Y\).

  • Furthermore, we will use the maximum norm for the distances in the space \(Z=(X,Y)\), i.e.

    \[||z-z'||_{\max} = \max \{ ||x - x'||, ||y - y'||\},\]

    while any norms can be used for \(||x - x'||\) and \(||y - y'||\).

  • We make further notations: \(\frac{\epsilon (i)}{2}\) is the distance between \(z_i\) and its \(k\)-th neighbour. \(\frac{\epsilon_x (i)}{2}\) and \(\frac{\epsilon_y (i)}{2}\) denote the distance between the same points projected into the \(X\) and :math:`Y`subspaces.

  • Note that \(\epsilon(i)=\max \lbrace \frac{\epsilon_x (i)}{2}, \frac{\epsilon_y (i)}{2}\rbrace\).

  • In the following, two algorithms for estimating mutual information will be taken into account:

    • In the first algorithm, the numbers of points \(x_j\) whose distance from \(x_i\) is strictly less than \(\frac{\epsilon (i)}{2}\) is counted and called \(n_x(i)\). Analogous for \(y\).

    • By \(<...>\) the averages over all \(i \in [1,...,N]\) and over all realisations of random samples is denoted:

      \[<...> = \dfrac{1}{N} \sum_{i=1}^N E[...(i)].\]
    • The mutual information can then be estimated with:

      \[I^{(1)}(X,Y) = \psi (k) - <\psi (n_x + 1) + \psi (n_y + 1)> + \psi (N).\]
    • In the second algorithm \(n_x(i)\) and \(n_y(i)\) are replaced by the number of points that satisfy the following equations:

      \[||x_i - x_j|| \leq \dfrac{\epsilon_x (i)}{2} \ \text{and} \ ||y_i - y_j|| \leq \dfrac{\epsilon_y (i)}{2}\]
    • Then mutual information can be estimated via

      \[I^{(2)}(X,Y) = \psi (k) - 1/k - <\psi (n_x) + \psi (n_y)> + \psi (N).\]
  • Generally, both estimates give similar results. But it proves that \(I^{(1)}\) has the tendency to have slightly smaller statistical errors, but larger systematic errors. This means that when we are interested in very high dimensions, we better should use \(I^{(2)}\).

3. SVCCA: singular vector canonical correlation analysis

[Raghu2017]

3.1 Key points of the paper

  • They developed a method that analyses each neuron’s activation vector (i.e. the scalar outputs that are emitted on input data points). This analysis gives an insight into learning dynamics and learned representation.

  • SVCCA is a general method that compares two learned representations of different neural network layers and architectures. It is either possible to compare the same layer at different time steps, or simply different layers.

  • The comparison of two representations fulfills two important properties:

    • It is invariant to affine transformation (which allows the comparison between different layers and networks).
    • It is fast to compute, which allows more comparisons to be calculated than with previous methods.

3.2 Experiment set-up

  • Dataset: mostly CIFAR-10 (augmented with random translations)
  • Architecture: One convolutional network and one residual network
  • In order to produce a few figures, they decided to design a toy regression task (training a four hidden layer fully connected network with 1D input and 4D output)

3.3 How SVCCA works

  • SVCCA is short for Singular Vector Canonical Correlation Analysis and therefore combines the Singular Value Decomposition with a Canonical Correlation Analysis.

  • The representation of a neuron is defined as a table/function that maps the inputs on all possible outputs for a single neuron. Its representation is therefore studied as a set of responses over a finite set of inputs. Formally, that means that given a dataset \(X = {x_1,...,x_m}\) and a neuron \(i\) on layer \(l\), we define \(z^{l}_{i}\) to be the vector of outputs on \(X\), i.e.

    \[z^{l}_{i} = (z^{l}_{i}(x_1),··· ,z^{l}_{i}(x_m)).\]

    Note that \(z^{l}_{i}\) is a single neuron’s response over the entire dataset and not an entire layer’s response for a single input. In this sense the neuron can be thought of as a single vector in a high-dimensional space. A layer is therefore a subspace of \(\mathbb{R}^m\) spanned by its neurons’ vectors.

  1. Input: takes two (not necessarily different) sets of neurons (typically layers of a network)

    \[l_1 = {z^{l_1}_{1}, ..., z^{l_{m_1}}_{l_1}} \text{ and } l_2 = {z^{l_2}_{1}, ..., z^{l_{m_2}}_{l_2}}\]
  2. Step 1: Use SVD of each subspace to get sub-subspaces \(l_1' \in l_1\) and \(l_2' \in l_2\), which contain of the most important directions of the original subspaces \(l_1, l_2\).

  3. Step 2: Compute Canonical Correlation similarity of \(l_1', l_2'\): linearly transform \(l_1', l_2'\) to be as aligned as possible and compute correlation coefficients.

  4. Output: pairs of aligned directions \((\widetilde{z}_{i}^{l_1}, \widetilde{z}_{i}^{l_2})\) and how well their correlate \(\rho_i\). The SVCCA similarity is defined as

    \[\bar{\rho} = \frac{1}{\min(m_1,m_2)} \sum_i \rho_i .\]

3.4 Results

  • The dimensionality of a layer’s learned representation does not have to be the same number than the number of neurons in the layer.
  • Because of a bottom up convergence of the deep learning dynamics, they suggest a computationally more efficient method for training the network - Freeze Training. In Freeze Training layers are sequentially frozen after a certain number of time steps.
  • Computational speed up is successfully done with a Discrete Fourier Transform causing all block matrices to be block-diagonal.
  • Moreover, SVCCA captures the semantics of different classes, with similar classes having similar sensitivities, and vice versa.

[SBD+18]Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, and David Daniel Cox. On the information bottleneck theory of deep learning. In International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=ry_WPG-A-.
[KraskovStogbauerGrassberger04]Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. \pre , 69:066138, June 2004. arXiv:cond-mat/0305641, doi:10.1103/PhysRevE.69.066138.
[RaghuGilmerYosinskiSohlDickstein17]M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. ArXiv e-prints, June 2017. arXiv:1706.05806.
[TishbyZaslavsky15]N. Tishby and N. Zaslavsky. Deep Learning and the Information Bottleneck Principle. ArXiv e-prints, March 2015. arXiv:1503.02406.