## Practice in Huggingface Transformers

# Practice in Huggingface Transformers

This post is just a note summarizing my practice on Pre-trained Language Models (PLMs) using Huggingface Transformers Package. I believe that these points I am summarizing is also confusing to other rookies of coding for transformer-based NLP models.

## Variational Inference

# Variational Inference

One of the core problems of modern statistics is to approximate difficult-to-compute (intractable) probability (usually conditional) densities. There are two main stream methods to solve it: **Monte Carlo sampling (MCMC)** and **Variational inference**. The former focuses on fitting continuous curve through a large amount of discrete values, while the latter employs a tractable simple distribution to approach the true distribution. Let’s understand it by an example of **Variational Auto-encoder (VAE)**.

## SSL Survey

# SSL Survey

We are on the cusp of a major research revolution, which is made by deep learning. In my perspective, there two outstanding contributions on the network architecture in this revolution. They are $\textit{ResNet}$ and $Transformer$. As the research exploration continues to deepen and especially the increase of computational capacity, the technology using unlabeled data attracts more and more concentrations. There is no doubt that the **self-supervised learning (SSL)** is a direction deserve diving into and a general methodology contribution in the revolution. Therefore, this post will focus survey the cutting-edge development of SSL from the following aspects: theory guarantee, image SSL, sequence SSL and graph SSL.

## Subjective, Objective, Assumption and Modeling

# Subjective, Objective, Assumption and Modeling

## Overview

There are 2 main methods to estimate parameters in statistics. i.e. **frequency-based method (Maximum Likelihood Estimation)** and **Bayesian-based method (Bayesian Estimation)**. Frankly speaking, we are supposed to get a deep insight to it and have a intuitive understanding on estimation. In this note, I provisionally offer some dimensions to explore it. i.e. **motivation, theoretical guarantee and algorithm**. Maybe I will enrich it in my future study. Note that there is **just a difference** in modeling unknown thing between 2 methods, but both are statistical method, which aims to estimate the whole distribution from the sampling data (in some case, do hypothesis then verify it).

## Daily Reading

# Daily Reading

## A Diversity-Promoting Objective Function for Neural Conversation Models

### Summary & Intuitions

- mutual information between source (message) and target (response)
- lack of theoretical guarantee

### Contributions

- decompose formula of mutual information:
`anti-lm`

: penalize not only high-frequency, generic responses but also grammatical sentence $\rightarrow$ weights of tokens decrease monotonically (early important + lm dominant later)`bidi-lm`

: not searching but reranking (generate grammatical sequences and then re-rank them according to the objective of inversed probability)

## Daily Reading

# Daily Reading

## Tensor2Tensor: One Model to Learn to Them All

### Summary & Intuitions

- multi-modality multi-task learning
- modality-specific subnets: typical pipelines
- modality-agnostic body: separable convolution (row conv + depth-wise conv, due to 1-d, 2-d inputs) and attention mechanism (self-attended + Query-presence-attended)
- joint training of tasks with deficient and sufficient data

### Contributions

- engineering considerations of modality subnets
- language input: linear mapping
- language output: linear mapping +
`softmax`

- image input: 2 separable conv + 1 pooling + residual link
- categorical output: 3 separable conv + 1 pooling + residual link + 2 separable conv + GAP + linear
- Audio input and output: wave (1d) or spectrogram (2d) is the same as aforementioned image input

## Qa Boundary Survey

# KBQAD Boundary Survey

## Overview

A general framework of XX task is made up of 3 phases: extraction, understanding and reasoning.