# SSL Survey

We are on the cusp of a major research revolution, which is made by deep learning. In my perspective, there two outstanding contributions on the network architecture in this revolution. They are $\textit{ResNet}$ and $Transformer$. As the research exploration continues to deepen and especially the increase of computational capacity, the technology using unlabeled data attracts more and more concentrations. There is no doubt that the self-supervised learning (SSL) is a direction deserve diving into and a general methodology contribution in the revolution. Therefore, this post will focus survey the cutting-edge development of SSL from the following aspects: theory guarantee, image SSL, sequence SSL and graph SSL.

## Definition

### Origination

The term “self-supervised learning” is first introduced in robotics, where training data is automatically labeled by leveraging the relations between different input sensor signals. In the invited speech on AAAI 2020, the Turing award winner Yann LeCun described SSL as: the machine predicts any parts of its input for any observed part:

• Obtain “labels” from the data itself by using a “semi-automatic” process;
• Predict part of the data from other parts; where “other parts” could be incomplete, transformed, distorted, or corrupted. In other words, “predict” can also be regarded as “recover”. In an extreme situation, the hidden/unobserved and unhidden/observed parts are likely to the whole (original) data.

### Difference between SSL and unsupervised learning

Firstly, the supervised learning must be reduced into classification and regression according the label attributes (discrete vs. real value or categorical vs. continuous variable), especially image-level, pixel-level, text-level and token level. Secondly, the semi-supervised learning aims to incorporate unlabeled data into the supervised algorithm. They are usually implemented by consistency regularization under the hypothesis of distribution/manifold/entropy. Finally, the unsupervised learning is a kind of algorithm without external signals (manual labels) to guide the target task. They usually serve as a preprocessing procedure, such as dimensionality reduction (DR), clustering, community discovery and anomaly detection. This kind of algorithm is designed to mine/capture the specific data patterns through different heuristic priors or constraints. From the aforementioned concepts, self-supervised learning (SSL) can be viewed as a branch of unsupervised learning. In this branch, there the 2 following characteristics: (i) SSL follows the paradigm of pre-training then finetuning and detects the general data pattern, which is useful for lots of downstream tasks, especially multiple modalities tasks (i.e. multimedia SSL). (ii) The downstream tasks are usually supervised learning algorithm, and SSL concentrates more on learning a robust and generalizable representation. In this viewpoint, SSL is equivalent to unsupervised representation learning.

TO-DO

## Generative Methods

There are four main major approaches in this kind of SSL, which consists of Auto-regressive (AR), Auto-encoder (AE), Flow-based and Hybrid models.

### Auto-regressive (AR)

This series of algorithms focus on modeling the generation procedure by a sequence. Specifically, they firstly regard the observed sample as a sequence of variables (a temporal series), then maximize the likelihood of observed sequence. During the computation of likelihood, the joint distribution of the sequence can be decomposed into the multiplication between marginal and conditional distributions by applying Bayesian Formula step-by-step, which can be represented as:

\begin{align} \theta^{\star} &= \arg\max \log \prod_{i}^N p(x_i \mid \theta) \\ &= \arg\max \log \prod_{i}^N p(x_i^1, \cdots, x_i^t, \cdots, x_i^T \mid \theta) \\ &= \arg\max \log \prod_{i}^N \prod_{t}^T p(x_i^t \mid x_i^1, \cdots, x_i^{t-1}, \theta) \\ &= \arg\max \sum_{i}^N \sum_{t}^T \log p(x_i^t \mid x_i^1, \cdots, x_i^{t-1}, \theta) \end{align}

where $\log p(x_i^t \mid x_i^1, \cdots, x_i^{t-1}, \theta)$ means the observed probability of the $t$-th variable in the $i$-th observed sample, which is conditioned on the previous variables ($1, \cdots, t-1$) of the identical sample. Note that although the decomposition changes with the selection of the first variable, the AR model has an unique one because of the temporal dependency. In addition, this objective function is also employed to optimize the language model (LM) and renamed as perplexity loss. Both perplexity loss in LM and cross entropy loss in classification can be converted into maximum likelihood estimation (MLE).

Different SSL AR models vary in the sequential modeling and utilization of outputs (self-supervision):

Model Fields Sequential Modeling Self-supervision Comments
GPT/GPT-2 NLP Trivial Next sentence prediction (NSP) Similarities: Transformer decoder, Masked self-attention
Differences: GPT (pre-training+finetuning), GPT-2 (in-context learning, given both inputs and the task)
PixelCNN CV top-left to bottom-right by masked filter/kernel Next pixel prediction (NPP) Embed PixelCNN as the decoder of Auto-encoder to perform SSL
PixelRNN CV top-left to bottom-right
R channel to G channel
Next pixel channel prediction (NPCP) Convolution -> long/short memory modeling
Neighbor strategy (many $h_{t-1}^0$ to be considered)
Row LSTM (3 pixels in last row), Diagonal BiLSTM (4/8 adjacency)

### Flow-based Model (FM) or Probabilistic Generative Model (PGM)

Yoshua Bengio proposed a idea that a good representation is one in which the data has a distribution that is easy to model. In the FM, this kind of goodness is described as easy-to-determinant and easy-to-inverse attributes. Different SSL models vary in the definition of pretty distributions.

Model Fields Definition of Good Distribution Self-supervision Comments
NICE CV easy-to-determinant
easy-to-invert
Image reconstruction A mapping $z = f(x)$ from sample x to factorizable hidden space z by independent components as $p(z) = \prod\limits_{d}p(z_d)$
The change of variable rule (just ICA basis transformation) gives: $p(x) = p(z) \left\vert \det \frac{\partial f(x)}{\partial x} \right\vert$ or $p(z) = p(x) \left\vert \det \frac{\partial f^{-1}(z)}{\partial z} \right\vert$
Deterministic and explicit $f$ is based on the decomposition of original data space: $I_1, I_2 = [ 1, D], d = \vert I_1 \vert$
$z_{I_1} = x_{I_1}$
$z_{I_2} = x_{I_2} + mlp(x_{I_1})$
Objective: $f^{\star} = \arg\max \sum\limits_{x}\log p(x)$
$= \arg\max \left( \sum\limits_{x}\log p\left( f(x) \right) + \log \left\vert \det \frac{\partial f(x)}{\partial x} \right\vert \right)$
RealNVP CV like NICE with non-volume persevering transformation (non-isometry) Image reconstruction Apply the change of variable rule (formula)
Another kind of decomposition and variable changing:
$z_{I_1} = x_{I_1}$
$z_{I_2} = x_{I_2} \odot \exp \left( \text{scale}(x_{I_1}) \right)+ \text{translate}(x_{I_1})$
which is also easy-to-determinant and easy-to-invert
Glow CV like NICE with invertible $1\times1$ convolution Image reconstruction Convolution as transformation: $\log \left\vert \det \frac{\partial \text{conv1x1}(h, W)}{\partial h} \right\vert = h \cdot w \cdot \log \left\vert \det W \right\vert$
where $h\in R^{c \times h \times w}$ and $W \in R^{c \times c}$ represent the input feature maps and the weight matrix of $1\times1$ channel-identical convolutional kernel (or template), respectively.

### Auto-encoder (AE)

AE is originated from Restricted Boltzmann Machine (RBM) and the statistical motivation and main principles of AE has been introduced and reviewed in-depth in my this post. Now we will review some AE based SSL models, especially distinguish them by fields of study and self-supervisions.

word2vec, FastText NLP CBOW: context tokens -> input tokens (pairwise)
Skip-gram: inputs -> context tokens (pairwise)
Key difference between word2vec and FastText:
FastText uses not a word token but the summation of character n-gram as the word representation.
DeepWalk, LINE Graph Skip-gram on vertices sequence DeepWalk: Skip-gram is employed to predict the vertex on the sliding window based on the representation of current vertex. CBOW is unfeasible due to the large length of the random walk path, especially on large graph.
LINE samples vertices from not the random walk path but the neighbor nodes. (relatively local structure)
VGAE Graph
BERT NLP
VQ-VAE CV

## Contrastive Methods (Instance-based Learning)

### Instance-instance

• PIRL, ExemplarCNN, Non-Parametric ID, SimCLR, MoCo