Level Up Your NLP Skills: Explore These 15 Open-Source Datasets

The field of Natural Language Processing (NLP) has experienced remarkable growth in recent years, driven by the increasing demand for advanced text recognition, sentiment analysis, speech recognition, and machine-to-human communication. To meet these demands, several innovations have emerged. Industry estimates suggest that the global NLP market will reach a staggering value of US$ 28.6 billion by 2026, with a projected compound annual growth rate (CAGR) of 11.71% between 2018 and 2026.

Contents

1. The Blog Authorship Corpus

About:

Category: Sentiment Analysis

Access the Dataset : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

2. Amazon Product Dataset

About:

Category: Sentiment Analysis

Access the Dataset: http://jmcauley.ucsd.edu/data/amazon/

3. Multi-Domain Sentiment Dataset

About:

Category: Sentiment Analysis

Access the Dataset : http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html

4. LibriSpeech

About:

Category: Speech Recognition

Access the Dataset : http://www.openslr.org/12/

5. Free Spoken Digit Dataset (FSDD)

About:

Category: Speech Recognition

Access the Dataset : https://github.com/Jakobovski/free-spoken-digit-dataset

6. Stanford Question Answering Dataset (SQuAD)

About:

Category: Question & Answering Analysis

Access the Dataset : https://rajpurkar.github.io/SQuAD-explorer/

7. Jeopardy! Questions in a JSON File

About:

Category: Questions & Answers Analysis

Access the Dataset : https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

8. Yelp Reviews

About:

Category: Text Classification

Access the Dataset : https://www.yelp.com/dataset

9. WordNet

About:

Category: Text Classification

Access the Dataset : https://wordnet.princeton.edu/

10. TIMIT

About:

Category: Speech Recognition

Access the Dataset : https://catalog.ldc.upenn.edu/LDC93S1

11. IMDB Movie Review Dataset

About:

Category: Sentiment Analysis

Access the Dataset : https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

12. Twitter Sentiment Analysis Dataset

About:

Category: Sentiment Analysis

Access the Dataset : https://www.kaggle.com/datasets/kazanova/sentiment140

13. Gutenberg eBooks Corpus

About:

Category: Text Classification

Access the Dataset: https://huggingface.co/datasets/biglam/gutenberg-poetry-corpus

14. Reuters News Corpus

About:

Category: Text Classification

Access the Dataset: https://paperswithcode.com/dataset/reuters-21578

15. WikiText-103

About:

Category: Language Modeling

Access the Dataset : https://developer.ibm.com/exchanges/data/all/wikitext-103/

16. MovieLens Dataset

About:

Category: Recommender Systems

Access the Dataset : https://grouplens.org/datasets/movielens/

17. CoNLL 2003

About:

Category: Named Entity Recognition

Access the Dataset : https://huggingface.co/datasets/conll2003

18. EuroParl

About:

Category: Machine Translation

Access the Dataset : https://www.statmt.org/europarl/

19. Cornell Movie Dialogs Corpus

About:

Category: Dialogue Systems

Access the Dataset : https://huggingface.co/datasets/cornell_movie_dialog

20. OpenSubtitles

About:

Category: Multilingual NLP

Access the Dataset : https://autonlp.ai/datasets/opensubtitles

In this article, we present 15 free and open-source NLP datasets that serve as excellent resources to kickstart your NLP project. These datasets cover various categories, including sentiment analysis, speech recognition, question answering analysis, text classification, and more.

1. The Blog Authorship Corpus

About:

The Blog Authorship Corpus comprises a collection of posts from 19,320 bloggers on blogger.com, gathered in August 2004. This dataset includes a total of 681,288 posts, amounting to over 140 million words. On average, each blogger contributed approximately 35 posts and 7,250 words. Each blog in the dataset is stored as a separate file, with the file name indicating the blogger’s ID, gender, age, industry, and astrological sign.

Category: Sentiment Analysis

Access the Dataset : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

2. Amazon Product Dataset

About:

The Amazon Product dataset provides product reviews and metadata from Amazon, spanning from May 1996 to July 2014. With a massive collection of 142.8 million reviews, this dataset offers insights into ratings, text, helpfulness votes, product descriptions, category information, price, brand, image features, and links (also viewed/also bought graphs).

Category: Sentiment Analysis

Access the Dataset: http://jmcauley.ucsd.edu/data/amazon/

3. Multi-Domain Sentiment Dataset

About:

The Multi-Domain Sentiment Dataset contains product reviews extracted from Amazon.com across four product types (domains): kitchen, books, DVDs, and electronics. While the exact number of reviews varies by domain, each domain provides several thousand reviews. The dataset includes star ratings (ranging from 1 to 5 stars) that can be converted into binary labels.

Category: Sentiment Analysis

Access the Dataset : http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html

4. LibriSpeech

About:

LibriSpeech consists of approximately 1000 hours of 16kHz read English speech. This corpus, prepared by Vassil Panayotov with the assistance of Daniel Povey, is derived from reading audiobooks from the LibriVox project. The dataset has undergone meticulous segmentation and alignment to ensure its accuracy and usability.

Category: Speech Recognition

Access the Dataset : http://www.openslr.org/12/

5. Free Spoken Digit Dataset (FSDD)

About:

The Free Spoken Digit Dataset (FSDD) is an open dataset that offers a collection of simple audio/speech recordings of spoken digits in WAV files at 8kHz. The recordings in this dataset have been meticulously trimmed, ensuring minimal silence at the beginnings and ends.

Category: Speech Recognition

Access the Dataset : https://github.com/Jakobovski/free-spoken-digit-dataset

6. Stanford Question Answering Dataset (SQuAD)

About:

The Stanford Question Answering Dataset (SQuAD) serves as a reading comprehension dataset, containing questions posed by crowd-workers based on a set of Wikipedia articles. Each question in the dataset has a corresponding reading passage, and the answer to the question is a segment of text (span) from the passage. SQuAD2.0, an enhanced version, incorporates 100,000 questions from SQuAD1.1 and over 50,000 new unanswerable questions designed to resemble answerable ones.

Category: Question & Answering Analysis

Access the Dataset : https://rajpurkar.github.io/SQuAD-explorer/

7. Jeopardy! Questions in a JSON File

About:

This dataset is a JSON file that includes 216,930 Jeopardy questions, answers, and additional data. Over the course of the show’s history, the total number of Jeopardy! questions has reached 252,583. The dataset has an approximate file size of 53 MB.

Category: Questions & Answers Analysis

Access the Dataset : https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

8. Yelp Reviews

About:

The Yelp dataset serves as a comprehensive collection of Yelp’s businesses, reviews, and user data. This versatile dataset is suitable for personal, educational, and academic purposes. It includes a vast amount of data, comprising 6,685,900 reviews, 200,000 pictures, and 192,609 businesses across ten metropolitan areas.

Category: Text Classification

Access the Dataset : https://www.yelp.com/dataset

9. WordNet

About:

WordNet is an extensive lexical database of English that organizes words based on their meanings. Similar to a thesaurus, WordNet groups words together by synonymy, such as “shut” and “close” or “car” and “automobile.” The dataset contains a total of 117,000 synsets, each interconnected through a small number of conceptual relations.

Category: Text Classification

Access the Dataset : https://wordnet.princeton.edu/

10. TIMIT

About:

The TIMIT Acoustic-Phonetic Continuous Speech Corpus is specifically designed to provide speech data for acoustic-phonetic studies and the development and evaluation of automatic speech recognition systems. This dataset includes broadband recordings of 630 speakers representing eight major dialects of American English. Each speaker reads ten phonetically rich sentences.

Category: Speech Recognition

Access the Dataset : https://catalog.ldc.upenn.edu/LDC93S1

11. IMDB Movie Review Dataset

About:

The IMDB Movie Review Dataset comprises a collection of movie reviews extracted from the IMDB website. The dataset includes labeled sentiment analysis data, with reviews classified as positive or negative. It serves as an excellent resource for training sentiment analysis models.

Category: Sentiment Analysis

Access the Dataset : https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

12. Twitter Sentiment Analysis Dataset

About:

The Twitter Sentiment Analysis Dataset is a collection of tweets extracted from the Twitter platform. The dataset contains labeled sentiment analysis data, with tweets categorized as positive, negative, or neutral. This dataset is ideal for training sentiment analysis models with a focus on social media sentiment.

Category: Sentiment Analysis

Access the Dataset : https://www.kaggle.com/datasets/kazanova/sentiment140

13. Gutenberg eBooks Corpus

About:

The Gutenberg eBooks Corpus is a vast collection of freely available eBooks from Project Gutenberg. This dataset comprises a wide range of literary works, including novels, plays, poems, and non-fiction texts. It serves as an excellent resource for various NLP tasks, such as text classification and language modeling.

Category: Text Classification

Access the Dataset: https://huggingface.co/datasets/biglam/gutenberg-poetry-corpus

14. Reuters News Corpus

About:

The Reuters News Corpus is a comprehensive dataset consisting of news articles from the Reuters news agency. This dataset covers a wide range of topics, including finance, politics, sports, and technology. With a large collection of articles, it is suitable for various NLP tasks such as text classification, topic modeling, and sentiment analysis.

Category: Text Classification

Access the Dataset: https://paperswithcode.com/dataset/reuters-21578

15. WikiText-103

About:

The WikiText-103 dataset is a large-scale language modeling dataset derived from Wikipedia articles. It comprises over 100 million tokens and is widely used for training and evaluating language models. This dataset provides a rich and diverse range of text, making it valuable for language modeling and text generation tasks.

Category: Language Modeling

Access the Dataset : https://developer.ibm.com/exchanges/data/all/wikitext-103/

16. MovieLens Dataset

About:

The MovieLens dataset is a popular dataset for recommender systems and movie-related analysis. It includes movie ratings, user information, and movie metadata. With millions of ratings and diverse movie genres, this dataset is suitable for building recommendation engines and conducting movie-related research.

Category: Recommender Systems

Access the Dataset : https://grouplens.org/datasets/movielens/

17. CoNLL 2003

About:

The CoNLL 2003 dataset is widely used for named entity recognition (NER) tasks. It consists of news articles from the Reuters Corpus labeled with named entity annotations for entities like persons, organizations, and locations. This dataset is essential for developing and evaluating NER models.

Category: Named Entity Recognition

Access the Dataset : https://huggingface.co/datasets/conll2003

18. EuroParl

About:

The EuroParl dataset is a collection of parallel texts from the proceedings of the European Parliament. It contains translated versions of parliamentary speeches in multiple languages. This dataset is valuable for tasks such as machine translation, cross-lingual information retrieval, and multilingual NLP.

Category: Machine Translation

Access the Dataset : https://www.statmt.org/europarl/

19. Cornell Movie Dialogs Corpus

About:

The Cornell Movie Dialogs Corpus is a dataset of movie conversations containing a wide range of dialogues between characters. It includes metadata such as movie titles, genres, and dialogue lines. This dataset is often used for tasks like dialogue generation, sentiment analysis, and language understanding.

Category: Dialogue Systems

Access the Dataset : https://huggingface.co/datasets/cornell_movie_dialog

20. OpenSubtitles

About:

The OpenSubtitles dataset is a large collection of movie and TV show subtitles in various languages. It provides a wealth of text data suitable for tasks like machine translation, dialogue systems, and language modeling. With subtitles from diverse genres and languages, it offers a valuable resource for multilingual NLP.

Category: Multilingual NLP

Access the Dataset : https://autonlp.ai/datasets/opensubtitles

By utilizing these 20 exceptional free and open-source NLP datasets, you can empower your NLP project with high-quality data. These datasets cover a wide range of categories and tasks, including sentiment analysis, speech recognition, question answering analysis, text classification, recommender systems, and more. Remember to properly attribute and comply with the respective dataset licenses and terms of use.