What does AI do about languages?
     The AI experts utilize the statistical relationships of words
from a language of expression and encode the words into sets
of mathematical vectors. Enter these vectors into the AI training model
and decode the model output to another expression which serves
a specific purpose of communication, such as classify the input content,
analyze the sentiment of the input, summarize a large content,
translate between languages, assess grammatical correctness,
and provide knowledge based answers to questions.
By integrating the language model with the image model,
AI can also generate images according to the given text or
generate caption according to the given image.
What is Large Language Model (LLM)?
     Large Language Models (LLM) are a large-scale deep learning model,
pre-trained on massive amount of data with hundreds of billions of parameters.
     Its architecture is based upon the Transformer
which can transform or change an input sequence into an output sequence.
Its neural network models learn context, and assign scores to text segments
in order to capture the intricate relationship between sequence components.
     Transformer LLMs are capable of self-supervised learning.
It generates implicit but intrinsic correlations and patterns, i.e.,
“pseudo-labels”, from unlabeled data to perform deep learning.
For example, by randomly hiding (or “masking”) parts of a given sentence
and training a self-supervised model with predicting the masked words,
using the original (unlabeled) sentence as ground truth.
Similar to the training process for traditional neural network,
a loss function, the gradient descent, and the backpropagation algorithm
are used to optimize the model.
     The transformers can also process each word segment in the entire text sequences
in parallel. This significantly reduces the training time.
In the pretext phase, i.e. “representation learning “,
the model learns meaningful representations of unstructured data.
Those learned representations are often fine-tuned for their specific
downstream tasks: this fine-tuning often involves true supervised learning.
The reuse of a pre-trained model on a new task is referred to as
“transfer learning.”
Such self-supervised learning is used in the training of a diverse array
of sophisticated deep learning architectures for a variety of tasks,
from transformer-based large language models (LLMs) like BERT and GPT
to image synthesis models like variational autoencoders (VAEs).
Transformer
     The original transformer architecture utilized both an encoder and a decoder,
as illustrated in the figure below.
Encoder
     The encoder’s role is to take in the input sequence
(such as a sentence in English) and
utilize a tokenizer to break the text into words or characters,
which are called tokens. The tokens are used with the positional encoding
to generate embedding vectors (h) of the input text sequence,
where words or phrases from the vocabulary are mapped to
vectors of real numbers.
Attention
     Instead of encoding the input sequence into
a single fixed vector, the attention mechanism develops
a context vector that relates words in different positions of
an input sequence.
     When the embedding vector enters the attention mechanism,
the embedding vector is multiplied by 3 weight matrices to generate
three vectors — query, key, and value. The weight matrices are initially
random but will be optimized after the training.
“An attention function [maps] a query and a set of key-value pairs
to an output, where the query, keys, values, and output are all vectors.
The output is computed as a weighted sum of the values,
where the weight assigned to each value is computed by
a compatibility function of the query with the corresponding key.”
Source:
Attention is all you need.
Query(Q) is a representation of a word (token) in the input sequence.
Key(K) is the token to which we are checking compatibility with Query.
The key is used for calculating the attention distribution.
For self-attention, the key is the catalog of the input tokens.
Value(V) is the actual representation vector associated with the key.
The value is used for encoding the context representation of the text sequence.
By calculating the attention score,
        
    
where d is Query size.
In the equation above,
the probability matrix (e) is formed by
the dot product of query and key (QK).
The probability matrix is
further normalized to become weightings (w) using a softmax function.
By multiplying the weightings (w) with the value vector (V),
the subsequent context vector is a weighted sum of the embedding vectors
and normalized probability.
Then
In this way, each token is contextualized and allows the signal for key tokens
to be amplified and less important tokens to be diminished by
using the attention scores. The attention score indicates
the strength of the relationship between a token in the sequence
with all the other tokens.
For example,
The school has a library because it provides education.
The word “it” has higher score relates to “school” and “education”.
The school has a library because it inventories books.
The word “it” has higher score relates to “library” and “books”.
Query=”it”
Key=”The”, “school”, “has”, “a”, “library”, “because”, “it”, “provides”, “education”
Value=”The”, “school”, “has”, “a”, “library”, “because”, “it”, “provides”, “education”
Context Vector (One of the elements)=
0.01*”The”+0.2*”school”+0.02*”has”+0.01*”a”+0.15*”library”+0.05*”because”+0.3*”it”+0.15*”provides”+0.2*”education”
where the numbers represent the probability of the dot product result of QK,
i.e., the attention probability between tokens.
And the words in the context vector are the elements of the Value vector.
         In the actual calculation, each element of the Value vector
has been trained and represented with numerical values.
Multi-Head Attention Mechanism
         Instead of performing a single attention function with only one word,
the multi-head attention mechanism splits its Query, Key, and Value
parameters multiple ways and passes each split independently through
a separate Head. All of these similar Attention calculations are then
combined together to produce a final Attention score.
This enables the encoding of multiple relationships and nuances for
each word and also the parallelization of the training.
Consequently, it captures the semantic and syntactic properties of
the input data. A key aspect that differentiates Transformers
from traditional neural networks is the attention mechanism.
Decoder
         The decoder works similar to the encoder
but it applies the self-attention mechanism to the target sequence
(such as Spanish), not the input sequence (such as English).
After the decoder produces context vectors associated with the target sequence,
it takes the encoded context vector from the encoder and
utilizes the multi-head attention mechanism to compute the interaction
between each target token with each input token.
Finally, the entire transformer is trained to generate the output
(such as a translated sentence in Spanish).
Transformers have the advantage of having no recurrent units,
and thus requires less training time than previous recurrent neural architectures,
such as long short-term memory (LSTM). They can be trained on
large language datasets, such as the Wikipedia corpus and Common Crawl.
Difference between Autoencoder and Encoder
         An autoencoder is a special type of neural network focusing on
encoding input data into a compact, latent representation and
then decoding it back to a reconstructed output.
Therefore it is not just a transformer encoder.
For example, given an image of a handwritten digit, an autoencoder
first encodes the image into a lower dimensional latent representation,
then decodes the latent representation back to an image.
An autoencoder learns to compress the data while minimizing
the reconstruction error. This makes them suitable for applications like data compression,
dimensionality reduction, and generative modeling.