Mastering Word Prediction: Unveiling the Magic of Masked Language Modelling

language modelling

In the world of natural language processing (NLP), masked language modelling has emerged as a fascinating application. By intentionally hiding words in a sentence and predicting them, masked language modelling enables us to perform word prediction. This technique has found a significant role in various applications, including language translation, sentiment analysis, and name generation. One particular application that deserves attention is masked image modelling, which combines the power of deep learning and NLP. In this article, we will delve into the intricacies of masked image modelling, with a focus on its implementation using BERT.

What is Masked Language Modelling?

Masked language modelling and image modelling share similarities with autoencoding models that construct outcomes from unarranged or corrupted input. As the name suggests, masking plays a crucial role in these modelling procedures. It involves masking words from an input sequence or sentence, and the model is then tasked with predicting the masked words to complete the sentence. This process is akin to filling in the blanks on an exam paper. To illustrate the working of masked image modelling, consider the following example:

Question: What is ______ name?

Answer: What is my/your/his/her/its name.

When it comes to model training, the model needs to learn the statistical properties of word sequences. As the model is expected to predict one or more words rather than the entire sentence or paragraph, it must grasp certain statistical properties. The model predicts words based on other words presented in a sentence.

Now that we have a basic understanding of masked language models, let’s explore the areas where they find application.

Applications of Masked Language Models

Masked language models are particularly useful in scenarios where predicting the context of words is crucial. Given that words can have different meanings depending on their context, these models need to learn deep and diverse representations of words. They have demonstrated improved performance in various downstream tasks, especially syntactic tasks that require lower layer representations of specific models instead of higher layer representations. Additionally, masked language models can be used to learn deep bidirectional representations of words, capturing the context both at the beginning and the end of a sentence.

Having discussed the significance of masked language models, let’s now examine their implementation.


In this article, we will utilize a BERT-based uncased model for masked language modelling. These models have been pre-trained on the English language using the BookCorpus data, consisting of 11,038 books, and English Wikipedia data (excluding list tables and headers) to achieve masked language modelling objectives.

For masked language modelling, the BERT-based model takes a sentence as input and masks 15% of the words within the sentence. By running the sentence with masked words through the model, it predicts the masked words and the context surrounding them. One of the advantages of this model is its ability to learn bidirectional representations of sentences, resulting in more precise predictions.

This model can also predict words using two masked sentences. It concatenates two masked words and attempts to make predictions. This approach enhances the precision of predictions when two sentences are correlated.

To obtain this model, we can utilize the transformer library and install it with the following lines of code:

!pip install transformers

After installing the library, we are ready to use the pre-trained models available in the pipeline module of the transformer model. Let’s import the necessary library:

from transformers import pipeline

Next, we’ll instantiate the model:

model = pipeline('fill-mask', model='bert-base-uncased')

With the model instantiated, we can now predict masked words. To do so, we need to replace the word we want to predict with [MASK] in the sentence. For example:

pred = model("What is [MASK] name?")

The output of the prediction will provide us with the predicted word, along with associated scores. We can also use this model to extract features from any text using the PyTorch or TensorFlow libraries.

Using PyTorch:

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "What is your name?"

encoded_input = tokenizer(text, return_tensors='pt')

Using TensorFlow:

from transformers import BertTokenizer, TFBertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained("bert-base-uncased")
text = "What is your name?"

encoded_input = tokenizer(text, return_tensors='tf')

One limitation of the model is its tendency to produce biased predictions, even after being trained on relatively neutral data. Consider the following examples:

pred = model("he can work as a [MASK].")
pred = model("She can work as a [MASK].")

In both cases, the model’s predictions exhibit bias. Nevertheless, this showcases the power and potential of building and utilizing a masked language model using the BERT transformer.

Final Thoughts

In this article, we have explored the fascinating world of masked image modelling, shedding light on its use cases. Additionally, we have delved into the implementation of a BERT-based uncased model for masked language modelling. Masked language models open up a realm of possibilities in the field of NLP, and their potential is being continuously unlocked. By leveraging the power of deep learning and natural language processing, masked image modelling paves the way for advancements in various domains.