Home

Variational Inference

Variational Inference

One of the core problems of modern statistics is to approximate difficult-to-compute (intractable) probability (usually conditional) densities. There are two main stream methods to solve it: Monte Carlo sampling (MCMC) and Variational inference. The former focuses on fitting continuous curve through a large amount of discrete values, while the latter employs a tractable simple distribution to approach the true distribution. Let’s understand it by an example of Variational Auto-encoder (VAE).

Click to read more ...

SSL Survey

SSL Survey

We are on the cusp of a major research revolution, which is made by deep learning. In my perspective, there two outstanding contributions on the network architecture in this revolution. They are $\textit{ResNet}$ and $Transformer$. As the research exploration continues to deepen and especially the increase of computational capacity, the technology using unlabeled data attracts more and more concentrations. There is no doubt that the self-supervised learning (SSL) is a direction deserve diving into and a general methodology contribution in the revolution. Therefore, this post will focus survey the cutting-edge development of SSL from the following aspects: theory guarantee, image SSL, sequence SSL and graph SSL.

Click to read more ...

Subjective, Objective, Assumption and Modeling

Subjective, Objective, Assumption and Modeling

Overview

There are 2 main methods to estimate parameters in statistics. i.e. frequency-based method (Maximum Likelihood Estimation) and Bayesian-based method (Bayesian Estimation). Frankly speaking, we are supposed to get a deep insight to it and have a intuitive understanding on estimation. In this note, I provisionally offer some dimensions to explore it. i.e. motivation, theoretical guarantee and algorithm. Maybe I will enrich it in my future study. Note that there is just a difference in modeling unknown thing between 2 methods, but both are statistical method, which aims to estimate the whole distribution from the sampling data (in some case, do hypothesis then verify it).

Click to read more ...

Daily Reading

Daily Reading

A Diversity-Promoting Objective Function for Neural Conversation Models

Summary & Intuitions

  • mutual information between source (message) and target (response)
  • lack of theoretical guarantee

Contributions

  • decompose formula of mutual information:
    • anti-lm: penalize not only high-frequency, generic responses but also grammatical sentence $\rightarrow$ weights of tokens decrease monotonically (early important + lm dominant later)
    • bidi-lm: not searching but reranking (generate grammatical sequences and then re-rank them according to the objective of inversed probability)

Click to read more ...

Daily Reading

Daily Reading

Tensor2Tensor: One Model to Learn to Them All

Summary & Intuitions

  • multi-modality multi-task learning
  • modality-specific subnets: typical pipelines
  • modality-agnostic body: separable convolution (row conv + depth-wise conv, due to 1-d, 2-d inputs) and attention mechanism (self-attended + Query-presence-attended)
  • joint training of tasks with deficient and sufficient data

Contributions

  • engineering considerations of modality subnets
    • language input: linear mapping
    • language output: linear mapping + softmax
    • image input: 2 separable conv + 1 pooling + residual link
    • categorical output: 3 separable conv + 1 pooling + residual link + 2 separable conv + GAP + linear
    • Audio input and output: wave (1d) or spectrogram (2d) is the same as aforementioned image input

Click to read more ...

LM Survey

LM Survey

Overview

My last post was mainly based on the question generation papers from ACL’20. However, there are quite a few papers from previous years due to recurssive reading process. Indeed I know little about history of language models and typical NLP tasks. Let’s step forward by defining some notations or denotations. As we know, the unit of NLP is tokens or token embeddings. Note a token doesn’t mean a word actually, though we can regard a word as a printablt token. That means there are some non-printable tokens, i.e. Classifier tag, Separator tag, Begin of Sequence, Start of Sequence and End of Sequence et al. . Printability and non-printability are the notions from ASCII codes (i.e. 31 non-printable + 95 printable + 1 non-printable). By the way, It is obvious that ASCII characters, Unicode characters or any available single-width characters belong to printable tokens. In addition to token, another core concept is sequence. Because a single token scarcely occures alone, it co-occures with other tokens based on ordering or temporal depenency, which is also the key challenge of modeling natural languages. Therefore, sequence embeddings (the list of token embddings in this sequence) is the content of data flow in NLP tasks. On the contrary, feature maps (the collection of multiple channels of image features) is the counterpart in CV tasks. Besides, like croping and resizing to procure well-shaped batch data in CV tasks, we need truncating or padding to process too long or too short sentences. Padding operation, however, inevitably compute extra gradients to the input text during back propagation. Conventionally we feed a mask with the input text to specify the gradientless span within the text, which also has an ambiguous and confusing name attention_mask in Transfromer based architectures.

Let’s go through the state-of-the-art models at the time and have a comprehensive understanding of transfer learning in NLP.

Click to read more ...