Practice in Huggingface Transformers
This post is just a note summarizing my practice on Pre-trained Language Models (PLMs) using Huggingface Transformers Package. I believe that these points I am summarizing is also confusing to other rookies of coding for transformer-based NLP models.
- input format: input dictionary with keys consisting of
- text modeling:
List[str](pre-tokenized sequence, aka whitespace-tokenized sequence) ,
tokenizer.encodeand requiring padding or truncation).
- batched text modeling: add a batch dimension for the above format, and the ambiguity of
List[str]will be addressed by the argument
is_split_into_words = False/True.
- build input: from version
tokenizer.__call__can handle all the situations, including a single text, a single pair of texts, a batch of texts and a batch of pairs of texts.
tokenizer.encodecan also handle a single text input, but it cannot preprocess the batched input. This is a function with very limited usage scenarios.
- from version
tokenizer.batch_encode_plusboth are deprecated and we should not assume that they are invoked in the implementation of
tokenizer.__call__in the future.
- what is
- It is a mask to reflect that the current token belongs to which input sequence.
- only when
return_token_type_ids = Truecan we make our tokenization clear and useful.
tokenizer.encodecan be fed
text_pair, it does not have a parameter dubbed as
return_token_type_ids. It can be only considered as a in-method-concatenation plus tokenization.
- this field is model-dependent and this dependency is consistent with the corresponding pre-training strategy of original papers.
BERTis pretrained on MLM and NSP/SOP and can handle the
GPT2is pretrained on autoregressive generation and can be fine-tuned on conditional generation, which demand it to handle the
BARTis pretrained on MLM using the Seq2Seq modeling and targets conditional generation only, which means it cannot handle the
BartTokenizerdoes not return valid
token_type_idswhen fed into a text pair.
Special Tokens in PLMs
tokenizer.__call__is set to
- there two groups of special tokens, comprising U-styled for NLU and G-styled for NLG.
pad_tokenfor marking the start, end and padding of a sequence. when
text_pairis fed into the tokenizer,
sep_tokenwill be appended into the end of each sequence.
pad_tokenfor the same purpose as above. when
text_pairis fed into the tokenizer, it is usually be accompanied by a
prefixto perform fine-tuning of conditional generation. At that time,
eos_tokenis used to mark the final stop of the whole sequence rather than the end of either.
- For encoder-decoder architecture, G-styled is the major style.
eos_tokencan also act as
BERT: U-styled -
GPT2: G-styled -
<|endoftext|>without primitive padding token
BART: G-styled -