Prerequisites:

Attention

Ever since 2017 when the paper "Attention Is All You Need" was published, the Transformer architecture became the dominant architecture in many AI models.
The main improvement transformers made over other architectures like RNNs and LSTMs is the use of the attention mechanism.
Attention measures a word's context in a sentence. For example, consider this set of two sentences:

If you translated this sentence word for word you would run into some issues. First of all, "dejé" corresponds to 2 words, "I left," and second of all in Spanish "libro rojo" directly translates to "book red" since adjectives go after the word.
One solution to these issues is attention, or in other words having each output word focus on key words in the input sentence. For example, "dejé" would pay attention especially to "I left" and "rojo" to "red."

The mechanism of attention is a really powerful concept; it turns out that attention can even be used in the source sentence to understand the "meaning" of words.

If you pay attention (no pun intended) to the above image, you'll notice the model starting to pick up on patterns in the language. For example, it correlates the word "left" with "at" because it knows the phrase "left ... at" is fairly common.

When calculating attention, we often want to focus on the most prominent results. In machine learning we do this with what's called a softmax function.
One way softmax can be calculated is $\frac{e^{x_i}}{\sum_{j=1}^n{e^{x_{j}}}}$ for every element $x_{i}$ in the data.
Here's a visual of the effect it has on data:

Scaled Dot-Product Attention is the base component of the attention
When calculating attention, the weight matrix is split into 3 parts: query, key, and value
You can think of each of these as analogous to their information retrieval counterparts:

The query is the current word we wish to find. For example, when translating from English to Spanish, we take the key and value from English and the query from Spanish.
The key is the information we have.
The value is the information that best relates to the query and keys.

The below image is the Scaled Dot-Product diagram from the original Attention is All You Need paper.

We first start by matrix multiplying the query and key ( $\mathbf{Q}\mathbf{K}^T$ )
Then, we scale the result by the square root of the dimension of the key ( $\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{k}}$ ). This is the "scaled" part of the Scaled Dot-Product Attention and ensures the dot product is not too large for softmax.
Then, sometimes we apply a mask to make sure when we're predicting the next word in training we don't cheat by looking into future words, so the attention only applies to words before it.
Finally we take a softmax to only process the most relevant words and combine that with the values.
Overall, this combines to the formula in the paper: $\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{softmax}(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{k}})\mathbf{V}$

With Multi-Head Attention you can think of it as a bunch of individually learned weights that apply a linear transformation on the input (e.g. $\mathbf{W}^K\mathbf{X}$ for the key matrix)
Each of these individually learned weights ( $W^K_i X_i$ ) is then fed through the scaled dot-product attention in parallel

This is so that the attention mechanism can take advantage of GPU architecture where parallel computing is more efficient than sequential computing

Finally, all of the scaled-dot product attentions are concatenated (combined) into one big matrix which then has its own separate learned weight $\mathbf{W}^O$
Here is a diagram of the overall process from Attention Is All You Need:

In a transformer model, there exists two types of attention mechanisms:
In self-attention, the key, query, and value are all the same.
Whereas in cross-attention, they come from different sources.
Both the Encoder and Decoder have self-attention in order to understand the context and meanings of their input sequence.
The decoder has a cross-attention in order to translate from one format to the other, such as one language to another language.
Here's an image explaining this distinction. Notice where the key, queries, and values come from in each Multi-Head Attention (source: https://towardsdatascience.com/attention-please-85bd0abac41):

Attention: