Sitemap

Transformers (Quick Revision)

3 min readJul 22, 2022

Self-Attention and Transformers Revision

Source: Google

Welcome back! This blog is the part of the Data Science/Machine Learning Series. Today we are going to discuss Self Attention and Transformers of Deep Learning. If you have missed the previous blogs, you can catch up on those here.

The Intuition

Transformer is a ground breaking research which changed the trajectory of Deep Learning. This is a revision series, so I assume you have learnt the Transformers previously. So, getting into the basics let’s start with the basic block diagram

Transformer Block

Transformer consists of two blocks,

  • Self-Attention Block
  • Feed Forward Neural Network Block

The advanced architectures like BERT, GPT has the transformer as their building blocks. So it is important to know the underlying math and concepts behind transformers. without further due let’s dive in. Transformers have a great advantage because it tries to understand the context of the sentence. This mainly helps to solve the problems such as machine translation, question answer chatbots and many more.

Self-Attention

If there is a sentence “There was a heavy rain, so dog couldn't cross the river because it was flooded with water”. Self attention makes model to associate “it” with the right word (here river). To compute the self-attention of the sentence we need to follow few steps. The inputs should be the embedding vector of the tokenized words.

In Matrix form we can represent this as

Matrix Form for the Self Attention Calculation

The Multi-head attention

The multi head attention basically adds parallel attention layers. This improves the performance of the network. The major take away of this are -

  • It expands the model’s ability to focus on different positions
  • It gives attention layer multiple “representation subspaces”
Multi-Head Attention

To sum up, here is the full computation of multi-head self attention

Multi-head self attention

One of the problem of this model is, it doesn’t consider the position of the word in the sentence. This creates a problem for understanding the sentence correctly. So, we need to add a positional encoding vector. This can be done by adding positional encoding with embedding.

So, this is it for Self-attention. Let’s move on to the Transformers.

Transformers

Diagrammatically explaining will give better intuition for the transformers

Transformer Block — In detail

As you can see the embedding vector is added up with the positional encoding vector and gives as inputs for the attention block. Further the outputs are added up with the embedded inputs (skip connection concept) and then the layernorm is performed. Further, the outputs of these are fed to the Feed forward network performing the normalisation along with skip connection. This is the transformer block. The Architectures like Bert and GPT can be created by stacking this further with changes relevant to the problem statement. To learn more about Transformers thouroughly I highly recommend Jay Alammar’s Blog.

Cool! That’s it for this blog. See you next time…

--

--

Navaneeth Sharma
Navaneeth Sharma

Written by Navaneeth Sharma

ML and Full Stack Developer | Love to Write

No responses yet