Google BERT: A Look Inside Bidirectional Encoder Representation from Transformer

  • Tanesh Balodi
  • Oct 14, 2019
  • NLP
  • Updated on: Nov 19, 2020
Google BERT: A Look Inside Bidirectional Encoder Representation from Transformer title banner

As we all know NLP is a vast field with a lot of tasks which are different from one another, So a model to perform all these tasks is a very challenging purpose, but we perform these tasks with the help of pipeline, where we divide the tasks, and group it all together to perform various tasks all at once, the only problem with advancement of natural language processing is its shortage of data which is needed to be trained.


Earlier in November 2018, google artificial intelligence research team came up with the natural language processing model which became a huge success in the prediction of missing words, sentiment analysis and mainly on the question/answering or in the development of chatbots. Let us know more about GOOGLE BERT.


Topics Covered

  1. Intuition behind BERT

  2. Pre-Training

  3. Working of BERT

  4. Applications of BERT

  5. Conclusion


What is the main idea behind BERT?


As the Google AI team says that this is a state-of-the-art NLP model and has shown results in various fields of NLP, according to the research paper this model has shown impressive accuracy on 11 natural language processing tasks. Training using BERT is so easy that anyone can train their own question-answering system in just a few hours on a single GPU. The pre-trained data is trained over millions of websites and thus can be fine-tuned over the smaller training data used for sentiment analysis or Q/A system.


The earlier model was released by Google AI like semi-supervised sequence learning, GPT and ULMfit were unidirectional whereas BERT is bidirectional, So if the bidirectional model provides such groundbreaking results, why they weren't using it on earlier version? The reason might be because, in unidirectionality, they were predicting the next token on the basis of the previous one, but this isn't the case with BERT as it is performing various tasks of natural language processing.


The main advancement BERT model has made is using bidirectional training over transformer as earlier unidirectional training were used, the paper of BERT which was released had two models with different number of parameters used, these two were:-


  1. BERT Base -> The base version of BERT has the same number of parameters as in OpenAI’s generative pre-trained transformer.

  1. BERT Large -> This model is trained over millions of parameters and training this version on normal single GPU would take months, it has better accuracy as compared to BERT Base for obviously because of so many parameters involved with it.


How BERT works?


The main part of Google BERT is its Pre-training and bidirectional transformer, we will begin with the tasks involved in Pre-training and then on the transformer’s working. So let's start with pre-training :


Pre-Training mainly involves two tasks that are masking of tokens and NSP(Next Sentence Prediction), let us discuss both the tasks



  1. Masking of Tokens -> The input word sequence are fed, out of which about 15% of the words are removed and replaced with [MASK], these [MASK] is then shot to transformer encoder in addition with classification layer that helps in identifying and replacing these [MASK] with relevant words that are in context with the word sequence. The output is generated with the help of ‘softmax’ activation function which gives the probability of each and every word in the word sequence.


Image shows the Masking of tokens in Google BERT.

Masking of Tokens


  1. Next Sentence Prediction(NSP) -> The next step of working involves prediction of the next sentence, for input we take two tokens which are [CLS] and [SEP], both of them are used in specific moments, like [CLS] is used at the head or the starting of the word sequence whereas [SEP] is used as a token at the end of the sentence. After locating these two tokens at their specific positions, there are two more steps to complete the next sentence prediction tasks, let’s see these two steps:-


  • Sentence Embedding -> In a word sequence, or a text document contains a lot of sentences, in this particular step we tend to tokenize sentences to differentiate between them. For example, if there are two sentences in a text then sentence embedding would be distinguishing between them as Sentence A and Sentence B.


  • Positional Embedding -> In the previous step, we tokenized sentences whereas the task for positional embedding is to provide a serial number for each word used in a text document.


This image is showing Next Sentence Prediction using Google BERT.

 Next Sentence Prediction


In order to make a prediction for the next sentence, we use the Transformer architecture which transforms this output, further, the outputs from tokens are fed to the classification layer and the next sentence is predicted using the probability of each word with the help of softmax function.


Consider this Statement -:


Kim II-Sung[MASK1], the founder of North Korea, was born the day the Titanic  [MASK2] sank.


In the above statement, Two words are tokenized which are holding some important information, now the whole point of Bidirectional masking is to predict in a more accurate way, like in this case both the directions are used rather than the conventional model that used either left to right approach or right to left approach. Due to its bidirectionality, it does lack in reaction time but holds good in terms of accuracy.


Applications of BERT    


There are a lot of natural languages processing tasks which can be done with the help of this model as all you need to feed the small data model to this core model in order to perform many tasks of NLP, some of the primary applications are elaborated below:-


  1. Sentiment Analysis using BERT ->  sentiment analysis in which we tend to search out the context of a sentence and classifies them as a positive, negative or neutral statement, all we need to feed is the data as input to the BERT core model which will eventually be fed into the transformer, this transformer will give output for the head token also known as [CLS] token which with the help of classification layer does the sentiment analysis task.


  1. Question-Answering system -> A question-answering model can be made with the help of google BERT that will allow the automated answers to the questions asked by the user, and to ask a relevant question when necessary, everyone one is aware how the e-commerce websites like Amazon, Flipkart, and SnapDeal uses this system to help their users.


  1. Search Engines -> With the help of the BERT model, a named entity recognition algorithm can be generated easily that can detect important names, places, organizations, animals, and much more to make a great search engine that will enhance its quality of work.


  1. Text Generation -> Google Bert is capable of Generating text with a small piece of input information that means it can also perform the major NLP tasks like language modeling, although in particular to language modeling it is not a better choice than OpenAI’s Generative Pre-trained model -2[GPT-2] which is considered to be the best language modeling model ever built.


Above were some of the primary applications of Google BERT, overall it can perform various other NLP tasks as well, we have mentioned above that its performance has been tested over 11 natural language processing tasks which show its efficiency.




Although in the case of language modeling and other tasks there is no comparison of GPT-2 as it is the most powerful pre-trained model ever built, any other model that comes closer to it is BERT, Google BERT was released on November 2018 whereas GPT is been released earlier this year, some of the Difference over performance and installation is discussed below -:


  1. Computation:- To run any NLP task with the help of BERT will require a few hours or less than that, but GPT-2 computation is so time-consuming because of the parameters involved with it, its smallest version carries a whopping 117 million parameters whereas the highest version end upon gigantic 1,542 million parameters.


  1. Fine-tuning:- There is no fine-tuning done on GPT-2 model whereas BERT has a particular stage for fine-tuning, in many cases, fine-tuning is considered to be the better option for Transfer learning.


  1. Architecture:- Although both use transformers and use unsupervised data, their architectures do not differ a lot from one another, the only noticeable factor is the number of parameters that are being increased to a gigantic figure in the case of GPT-2.


Image shows the Transformer architecture in GPT-2

The architecture of the Transformer in GPT-2


Experiments over BERT


BERT has experimented with the General Language Understanding Evaluation(GLUE) which has a collection of distinct natural language tasks, they used 3 epochs over the batch size of 32 over the data of GLUE task and fine-tuned it on a descent learning rate, it was observed that on the smaller dataset, the learning rate was unstable so they selected the suitable model on Dev sets. The results or the accuracy achieved by experimenting with BERT was outstanding as BERT small achieved the accuracy of an average 79.6 whereas BERT large achieved 82.1 accuracy which was the highest accuracy achieved by any NLP model at that time. 




BERT was introduced as an advancement for Embeddings from language models(ELMo) which was the unidirectional model, BERT was the first model to show the bidirectional approach in natural language processing.


BERT achieved impressive accuracy and showed results on various NLP tasks and thus is an outstanding model, it ruled for a few months before GPT-2 took all the glory of being the most advanced NLP technique ever built, but if we observe closely, we will see that not many changes are made in GPT-2 structure as it follows almost the same architecture with just a much larger number of parameters, so to conclude BERT showed the most revolutionary approach towards natural language processing techniques and most importantly in new text generation and transfer learning. For more blogs in Analytics and new technologies do read Analytics Steps.



  • pooja kumawat

    Oct 15, 2019

    Nice blog

  • rajatktiwari1997

    Jun 05, 2020

    WoW!! Diagrams are quite illustrative and catchy . Great work .