Prerequisites:

Attention

    Introduction

  • Ever since 2017 when the paper "Attention Is All You Need" was published, the Transformer architecture became the dominant architecture in many AI models.

  • The main improvement transformers made over other architectures like RNNs and LSTMs is the use of the attention mechanism.

  • Attention measures a word's context in a sentence. For example, consider this set of two sentences:

    • "I left your red book at home."
    • "Dejé tu libro rojo en casa."
  • If you translated this sentence word for word you would run into some issues. First of all, "dejé" corresponds to 2 words, "I left," and second of all in Spanish "libro rojo" directly translates to "book red" since adjectives go after the word.

  • One solution to these issues is attention, or in other words having each output word focus on key words in the input sentence. For example, "dejé" would pay attention especially to "I left" and "rojo" to "red."

  • A diagram showing attention for the english to spanish translation
  • The mechanism of attention is a really powerful concept; it turns out that attention can even be used in the source sentence to understand the "meaning" of words.

  • A diagram showing attention for the english sentence in isolation
  • If you pay attention (no pun intended) to the above image, you'll notice the model starting to pick up on patterns in the language. For example, it correlates the word "left" with "at" because it knows the phrase "left ... at" is fairly common.

  • A diagram showing attention for the english sentence in isolation

    Softmax

  • When calculating attention, we often want to focus on the most prominent results. In machine learning we do this with what's called a softmax function.

  • One way softmax can be calculated is exij=1nexj\frac{e^{x_i}}{\sum_{j=1}^n{e^{x_{j}}}} for every element xix_{i} in the data.

  • Here's a visual of the effect it has on data:

  • A diagram showing how softmax isolates the most prominent

    Scaled Dot-Product Attention

  • Scaled Dot-Product Attention is the base component of the attention

  • When calculating attention, the weight matrix is split into 3 parts: query, key, and value

  • You can think of each of these as analogous to their information retrieval counterparts:

    • The query is the current word we wish to find. For example, when translating from English to Spanish, we take the key and value from English and the query from Spanish.
    • The key is the information we have.
    • The value is the information that best relates to the query and keys.
  • The below image is the Scaled Dot-Product diagram from the original Attention is All You Need paper.

  • the diagram from the original Attention is All You Need Paper that shows Scaled Dot-Product Attention
  • We first start by matrix multiplying the query and key (QKT\mathbf{Q}\mathbf{K}^T)

  • Then, we scale the result by the square root of the dimension of the key (QKTk\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{k}}). This is the "scaled" part of the Scaled Dot-Product Attention and ensures the dot product is not too large for softmax.

  • Then, sometimes we apply a mask to make sure when we're predicting the next word in training we don't cheat by looking into future words, so the attention only applies to words before it.

  • Finally we take a softmax to only process the most relevant words and combine that with the values.

  • Overall, this combines to the formula in the paper: Attention(Q,K,V)=softmax(QKTk)V\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{k}})\mathbf{V}

    Multi-Head Attention

  • With Multi-Head Attention you can think of it as a bunch of individually learned weights that apply a linear transformation on the input (e.g. WKX\mathbf{W}^K\mathbf{X} for the key matrix)

  • Each of these individually learned weights (WiKXiW^K_i X_i) is then fed through the scaled dot-product attention in parallel

    • This is so that the attention mechanism can take advantage of GPU architecture where parallel computing is more efficient than sequential computing
  • Finally, all of the scaled-dot product attentions are concatenated (combined) into one big matrix which then has its own separate learned weight WO\mathbf{W}^O

  • Here is a diagram of the overall process from Attention Is All You Need:

  • the diagram from the original Attention Is All You Need Paper that shows Multi-Head Attention

    Self-Attention and Cross-Attention

  • In a transformer model, there exists two types of attention mechanisms:

  • In self-attention, the key, query, and value are all the same.

  • Whereas in cross-attention, they come from different sources.

  • Both the Encoder and Decoder have self-attention in order to understand the context and meanings of their input sequence.

  • The decoder has a cross-attention in order to translate from one format to the other, such as one language to another language.

  • Here's an image explaining this distinction. Notice where the key, queries, and values come from in each Multi-Head Attention (source: https://towardsdatascience.com/attention-please-85bd0abac41):

  • image showing flow of key query and value throughout the transformer model

Visualization:

Here's a visualization of attention for our example (using BertViz: https://github.com/jessevig/bertviz, link to Github Gist: https://gist.github.com/notrandomath/4638812d141d3c9adf4f104a167d478f)

Attention:

    Additional Materials

  • https://www.comet.com/site/blog/explainable-ai-for-transformers/

  • https://stats.stackexchange.com/a/424127

  • https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5047s