Encoder-Decoder Architecture
     The encoder-decoder architecture captures the strength of encoding information
for training and decoding the generated representation into meaningful results.
The major applications include language translation, image caption generation,
and speech recognition.
BART (Bidirectional and Auto-Regressive Transformers)
     BART is a denoising autoencoder for pretraining sequence-to-sequence models.
It is trained by (1) corrupting text with an arbitrary noising function,
and (2) learning a model to reconstruct the original text.
It uses a standard Transformer-based neural machine translation architecture.
It uses a standard seq2seq/NMT (Neural Machine Translation) architecture
with an encoder (like BERT with bidirectional and autoencoder features)
and a decoder (like GPT with the unidirectional and autoregressive features).
This means the encoder's attention mask is fully visible, like BERT,
and the decoder's attention mask is causal, like GPT2.
BART is good in summarization tasks but were trained on text corpuses
that are not remotely similar to chat conversations.
For the conversation summarization task, the model can be refined using
conversation datasets and retrained using Seq2SeqTrainer.
Source: BART: Denoising Sequence-to-Sequence Pre-training
for Natural Language Generation, Translation, and Comprehension
Encoder-only Architecture
     Encoder-only models are the language interpreters of
the AI world. Encoder-only models are great at extracting answers to
factoid questions like “Who” and “What” but not with open-ended questions
like “Why”.
BERT (Bidirectional Encoder Representation
from Transformers)
     The primary emphasis of BERT is to
understand input sequences rather than generating output sequences.
Therefore, only the encoder mechanism
is necessary.
     BERT training is non-directional. It takes in all words in the input sequence at once and therefore it is non-directional (or bi-directional). This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
The BERT model undergoes a two-step process:
(a) Pre-training on Large amounts of unlabeled text to learn
contextual embeddings.
(b) Fine-tuning on labeled data for specific NLP tasks.
     Source: BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
(A) Pre-training Phase
     BERT utilizes both Masked LM and
Next-Sentence Prediction LM. When training the BERT model,
Masked LM and Next Sentence Prediction are trained together
to form a core model.
(1) Masked LM (MLM) randomly masks certain tokens
in the input to avoid seeing other tokens, which would allow it to “predict”.
The objective is to predict the masked token based on the context.
This overcomes the uni-directionality constraint.
(2) Next-Sentence Prediction LM receives pairs of sentences
as input and learns to predict if the second sentence in the pair is
the subsequent sentence in the original document.
(B) Fine-tuning Phase
     BERT uses labeled data specific to the downstream tasks
of interest for training. The fine-tuned model is built on top of the pre-trained model,
which is great at understanding text. By adding a dropout layer to keep things in check
and a linear layer for a specific task, the new fine-tuned model takes in
some input IDs and attention masks, and runs them through the pre-trained model
and the extra added layers. The fine-tuned model can then perform the downstream tasks.
This means BERT often require extensive task-specific
training to fine-tune for specific applications.
Classification tasks such as sentiment analysis are done similarly to
Next Sentence classification, by adding a classification layer
on top of the Transformer output.
The types of sentence classification are:
*Intent Detection
*Language Detection
*Sentiment Analysis
*Spam Detection
*Topic Labeling
Named Entity Recognition (NER) The model receives a text sequence
and is required to mark the various types of entities
(Person, Organization, Date, etc) that appear in the text.
Using BERT, a NER model can be trained by feeding the labeled entity data
for each token into a classification layer that predicts the NER label.
Natural Language Inference
Natural language inference, also known as Recognizing Textual Entailment,
helps to assess whether a pair of sentences convey similar information
despite the use of different vocabulary, syntactic ambiguity, etc.
The fine-tuned model takes in a pair of premise and hypothesis texts
and outputs one of the three classes (entailment, contradiction, and neutral).
For example, given the premise
"A dog jumping for a Frisbee in the snow."
and the hypothesis
"A dog is outside in the snow, playing with a plastic toy."
a natural language inference model would output "entailment".
Question Answering tasks The software receives a question regarding
a text sequence and is required to mark the answer in the sequence.
The Q&A model is provided with questions and corresponding passages,
and it learns to predict the start and end positions of the answer within
the passage. This is accomplished by learning two extra vectors that
mark the beginning and the end of the answer.
Decoder-only Architecture
     OpenAI’s GPT (Generative Pretrained Transformer) models
adopt a decoder-only architecture. It has been trained on a massive amount
of text data to generate human-like responses to a given input.
They are used to generate outputs and are pre-trained on Next Token Prediction
tasks.
     The decoder-only architecture uses causal decoder
trained on an autoregressive language modeling objective,
different from the bidirectional decoder trained on a masked language modeling.
Causal language modeling is pre-trained on Next Token Prediction tasks.
In the training process, the model predicts the next token in a sequence
of tokens, and the model can only attend to tokens on the left.
This means the model cannot see future tokens.
     This model exhibits the strongest zero-shot generalization
after purely self-supervised pretraining, where zero-shot learning is a technique
in which a machine learning model can recognize and classify new concepts
without any labeled examples.
The decoder still has the capability of tokenizing the input text sequence.
The multi-head attention mechanism is similar to the encoder-decoder transformer.
However, the training objective is Next Token Prediction.
Since the model is auto regressive, it predicts one token at a time.
After the model is pretrained, it will go through the fine tuning steps:
Supervised fine tuning
A pretrained GPT model can be fine tuned using custom datasets,
composed of input (prompt) and output (completion) pairs.
After the supervised learning, the trained model can respond according to
the input prompt.
Reward Model
In order to train a language model to have positive sentiment,
the pairs of statement versus its sentiment ranking will be used to fine-tune
the pre-trained GPT model.
It will be costly to ask human labelers to create all of the sentiment ranking.
Therefore, a language model is trained to generate the sentiment ranking.
Initially, human labelers will rank the response from the pretrained GPT model.
This process creates a training set, composed of the pairs of statement and its sentiment ranking (reward).
Then the next step is to use this training set to train the model
such that the model itself can create the statement versus sentiment ranking pairs.
Fine-tune using statement and reward
a) Use the pretrained model to generate statements according to input prompts.
b) Use the statements from step (a) to generate sentiment ranking, which is the reward.
c) Use the training dataset (statement, reward) to optimize
the parameters of the fine-tuned model such that its response is tailored with
positive sentiment (high rewards).