# Practice in Huggingface Transformers

This post is just a note summarizing my practice on Pre-trained Language Models (PLMs) using Huggingface Transformers Package. I believe that these points I am summarizing is also confusing to other rookies of coding for transformer-based NLP models.

## Tokenization

### API Reference

• input format: input dictionary with keys consisting of input_ids, attention_mask, token_type_ids, labels.
• text modeling: str, List[str](pre-tokenized sequence, aka whitespace-tokenized sequence) , List[int](after invoking tokenizer.encode and requiring padding or truncation).
• batched text modeling: add a batch dimension for the above format, and the ambiguity of List[str] will be addressed by the argument is_split_into_words = False/True.
• build input: from version 4.4.2, tokenizer.__call__ can handle all the situations, including a single text, a single pair of texts, a batch of texts and a batch of pairs of texts.
• tokenizer.encode can also handle a single text input, but it cannot preprocess the batched input. This is a function with very limited usage scenarios.
• from version 4.4.2, tokenizer.encode_plus and tokenizer.batch_encode_plus both are deprecated and we should not assume that they are invoked in the implementation of tokenizer.__call__ in the future.
• what is token_type_ids
• It is a mask to reflect that the current token belongs to which input sequence.
• only when return_token_type_ids = True can we make our tokenization clear and useful.
• although tokenizer.encode can be fed text and text_pair, it does not have a parameter dubbed as return_token_type_ids. It can be only considered as a in-method-concatenation plus tokenization.
• this field is model-dependent and this dependency is consistent with the corresponding pre-training strategy of original papers.
• BERT is pretrained on MLM and NSP/SOP and can handle the token_type_ids input.
• GPT2 is pretrained on autoregressive generation and can be fine-tuned on conditional generation, which demand it to handle the token_type_ids input.
• BART is pretrained on MLM using the Seq2Seq modeling and targets conditional generation only, which means it cannot handle the token_type_ids input. Moreover, BartTokenizer does not return valid token_type_ids when fed into a text pair.
• others TO-BE-Done

### Special Tokens in PLMs

• add_special_tokens in tokenizer.__call__ is set to True as default.
• there two groups of special tokens, comprising U-styled for NLU and G-styled for NLG.
• U-styled: cls_token, sep_token and pad_token for marking the start, end and padding of a sequence. when text_pair is fed into the tokenizer, sep_token will be appended into the end of each sequence.
• G-styled: bos_token, eos_token and pad_token for the same purpose as above. when text_pair is fed into the tokenizer, it is usually be accompanied by a prompt or prefix to perform fine-tuning of conditional generation. At that time, eos_token is used to mark the final stop of the whole sequence rather than the end of either.
• For encoder-decoder architecture, G-styled is the major style. bos_token and eos_token can also act as cls_token and sep_token, respectively.
• BERT: U-styled - [CLS], [SEP], [PAD]
• GPT2: G-styled - <|endoftext|>, <|endoftext|> without primitive padding token
• BART: G-styled - <s>, </s>, <pad>
• others: TO-BE-Done