Doc2Vec Made Easy: A Step-by-Step Guide to Gensim Implementation

If you’re a natural language processing (NLP) enthusiast or just starting in the field, you may have come across the Doc2Vec model. Doc2Vec is a popular NLP model that is used for document similarity and classification tasks. In this article, we will discuss how to implement a Doc2Vec model using Gensim, a popular Python library for topic modeling, document indexing, and similarity retrieval with large corpora

Contents

Introduction to Doc2Vec

Installing Gensim

Preparing the Data

Training the Doc2Vec Model

Using the Doc2Vec Model

Conclusion

Introduction to Doc2Vec

Doc2Vec is an extension of the popular Word2Vec model that was introduced by Tomas Mikolov in 2013. The Doc2Vec model is used for document embedding, which means it represents the documents in a vector space, allowing us to measure the similarity between documents. Unlike the bag-of-words model, Doc2Vec captures the semantic meaning of the words, making it more accurate in document classification and similarity tasks.

Installing Gensim

Before we can start implementing a Doc2Vec model, we need to install Gensim. Gensim can be easily installed using pip. Open a terminal window and run the following command:

pip install gensim

Preparing the Data

To train a Doc2Vec model, we need a corpus of documents. In this example, we will be using the Lee corpus, which contains 314 documents. We will download the corpus and save it to a local directory:

import urllib.request

# Downloading the corpus
url = "https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json"
urllib.request.urlretrieve(url, "./newsgroups.json")

After downloading the corpus, we can load it using the following code:

import json

# Loading the corpus
with open("./newsgroups.json", "r") as f:
    data = json.load(f)

We will then preprocess the data by tokenizing the documents and removing stop words:

import re
import nltk
from gensim.parsing.preprocessing import remove_stopwords

# Preprocessing the data
nltk.download("punkt")

docs = []
for d in data["data"]:
    # Tokenizing the document
    tokens = nltk.word_tokenize(d["content"].lower())

    # Removing stop words and non-alphanumeric characters
    tokens = [remove_stopwords(t) for t in tokens if t.isalnum()]

    docs.append(tokens)

Training the Doc2Vec Model

Now that we have preprocessed the data, we can train the Doc2Vec model. To train the model, we first need to convert the documents into TaggedDocument objects. A TaggedDocument is a simple data structure that contains the document text and a unique tag:

from gensim.models.doc2vec import TaggedDocument

# Converting the documents into TaggedDocument objects
tagged_docs = []
for i, d in enumerate(docs):
    tagged_docs.append(TaggedDocument(d, [i]))

We can then train the Doc2Vec model:

from gensim.models import Doc2Vec

# Training the Doc2Vec model
model = Doc2Vec(
    vector_size=50,  # Size of the document vectors
    min_count=2,  # Ignore words with a frequency less than 2
    epochs=40  # Number of iterations over the corpus
)

model.build_vocab(tagged_docs)  # Build the vocabulary
model.train(tagged_docs, total_examples=model.corpus_count, epochs=model.epochs)  # Train the model

Using the Doc2Vec Model

Now that we have trained the Doc2Vec model, we can use it for document similarity and classification tasks. To find the most similar document to a given query document, we can use the most_similar() method:

# Finding the most similar document
query_doc = docs[0]
vector = model.infer_vector(query_doc)
similar_doc = model.docvecs.most_similar([vector])

print(data["data"][similar_doc[0][0]]["content"])

The above code will print the content of the most similar document to the first document in the corpus.

To classify a document, we can use the infer_vector() method to convert the document into a vector and then pass it to a classification algorithm:

# Classifying a document
from sklearn.linear_model import LogisticRegression

# Creating the training data
train_data = []
train_labels = []
for i, d in enumerate(tagged_docs):
    train_data.append(model.infer_vector(d.words))
    train_labels.append(data["data"][i]["target"])

# Training a logistic regression classifier
clf = LogisticRegression()
clf.fit(train_data, train_labels)

# Classifying a new document
new_doc = "This is a test document"
new_doc = nltk.word_tokenize(new_doc.lower())
new_doc = [remove_stopwords(t) for t in new_doc if t.isalnum()]
new_doc_vector = model.infer_vector(new_doc)

print(clf.predict([new_doc_vector]))

The above code will classify the new document into one of the target classes.

Conclusion

In this article, we discussed how to implement a Doc2Vec model using Gensim. We covered the basics of Doc2Vec, how to install Gensim, preparing the data, training the Doc2Vec model, and using the model for document similarity and classification tasks. Doc2Vec is a powerful NLP model that can be used for a variety of tasks, and Gensim makes it easy to implement.