## Variational Inference

# Variational Inference

One of the core problems of modern statistics is to approximate difficult-to-compute (intractable) probability (usually conditional) densities. There are two main stream methods to solve it: **Monte Carlo sampling (MCMC)** and **Variational inference**. The former focuses on fitting continuous curve through a large amount of discrete values, while the latter employs a tractable simple distribution to approach the true distribution. Let’s understand it by an example of **Variational Auto-encoder (VAE)**.

## SSL Survey

# SSL Survey

We are on the cusp of a major research revolution, which is made by deep learning. In my perspective, there two outstanding contributions on the network architecture in this revolution. They are $\textit{ResNet}$ and $Transformer$. As the research exploration continues to deepen and especially the increase of computational capacity, the technology using unlabeled data attracts more and more concentrations. There is no doubt that the **self-supervised learning (SSL)** is a direction deserve diving into and a general methodology contribution in the revolution. Therefore, this post will focus survey the cutting-edge development of SSL from the following aspects: theory guarantee, image SSL, sequence SSL and graph SSL.

## Subjective, Objective, Assumption and Modeling

# Subjective, Objective, Assumption and Modeling

## Overview

There are 2 main methods to estimate parameters in statistics. i.e. **frequency-based method (Maximum Likelihood Estimation)** and **Bayesian-based method (Bayesian Estimation)**. Frankly speaking, we are supposed to get a deep insight to it and have a intuitive understanding on estimation. In this note, I provisionally offer some dimensions to explore it. i.e. **motivation, theoretical guarantee and algorithm**. Maybe I will enrich it in my future study. Note that there is **just a difference** in modeling unknown thing between 2 methods, but both are statistical method, which aims to estimate the whole distribution from the sampling data (in some case, do hypothesis then verify it).

## Daily Reading

# Daily Reading

## A Diversity-Promoting Objective Function for Neural Conversation Models

### Summary & Intuitions

- mutual information between source (message) and target (response)
- lack of theoretical guarantee

### Contributions

- decompose formula of mutual information:
`anti-lm`

: penalize not only high-frequency, generic responses but also grammatical sentence $\rightarrow$ weights of tokens decrease monotonically (early important + lm dominant later)`bidi-lm`

: not searching but reranking (generate grammatical sequences and then re-rank them according to the objective of inversed probability)

## Daily Reading

# Daily Reading

## Tensor2Tensor: One Model to Learn to Them All

### Summary & Intuitions

- multi-modality multi-task learning
- modality-specific subnets: typical pipelines
- modality-agnostic body: separable convolution (row conv + depth-wise conv, due to 1-d, 2-d inputs) and attention mechanism (self-attended + Query-presence-attended)
- joint training of tasks with deficient and sufficient data

### Contributions

- engineering considerations of modality subnets
- language input: linear mapping
- language output: linear mapping +
`softmax`

- image input: 2 separable conv + 1 pooling + residual link
- categorical output: 3 separable conv + 1 pooling + residual link + 2 separable conv + GAP + linear
- Audio input and output: wave (1d) or spectrogram (2d) is the same as aforementioned image input

## Qa Boundary Survey

# KBQAD Boundary Survey

## Overview

A general framework of XX task is made up of 3 phases: extraction, understanding and reasoning.

## LM Survey

# LM Survey

## Overview

My last post was mainly based on the question generation papers from ACL’20. However, there are quite a few papers from previous years due to recurssive reading process. Indeed I know little about history of language models and typical NLP tasks. Let’s step forward by defining some notations or denotations. As we know, the unit of NLP is tokens or token embeddings. Note a token doesn’t mean a word actually, though we can regard a word as a printablt token. That means there are some non-printable tokens, i.e. Classifier tag, Separator tag, Begin of Sequence, Start of Sequence and End of Sequence et al. . Printability and non-printability are the notions from ASCII codes (i.e. 31 non-printable + 95 printable + 1 non-printable). By the way, It is obvious that ASCII characters, Unicode characters or any available single-width characters belong to printable tokens. In addition to token, another core concept is sequence. Because a single token scarcely occures alone, it co-occures with other tokens based on ordering or temporal depenency, which is also the key challenge of modeling natural languages. Therefore, **sequence embeddings** (the list of token embddings in this sequence) is the content of data flow in NLP tasks. On the contrary, **feature maps** (the collection of multiple channels of image features) is the counterpart in CV tasks. Besides, like **croping and resizing** to procure well-shaped batch data in CV tasks, we need **truncating or padding** to process too long or too short sentences. Padding operation, however, inevitably compute extra gradients to the input text during back propagation. Conventionally we feed a mask with the input text to specify the gradientless span within the text, which also has an ambiguous and confusing name **attention_mask** in **Transfromer** based architectures.

Let’s go through the state-of-the-art models at the time and have a comprehensive understanding of transfer learning in NLP.