The field of Natural Language Processing (NLP) has experienced remarkable growth in recent years, driven by the increasing demand for advanced text recognition, sentiment analysis, speech recognition, and machine-to-human communication. To meet these demands, several innovations have emerged. Industry estimates suggest that the global NLP market will reach a staggering value of US$ 28.6 billion by 2026, with a projected compound annual growth rate (CAGR) of 11.71% between 2018 and 2026.
In this article, we present 15 free and open-source NLP datasets that serve as excellent resources to kickstart your NLP project. These datasets cover various categories, including sentiment analysis, speech recognition, question answering analysis, text classification, and more.
1. The Blog Authorship Corpus
About:
The Blog Authorship Corpus comprises a collection of posts from 19,320 bloggers on blogger.com, gathered in August 2004. This dataset includes a total of 681,288 posts, amounting to over 140 million words. On average, each blogger contributed approximately 35 posts and 7,250 words. Each blog in the dataset is stored as a separate file, with the file name indicating the blogger’s ID, gender, age, industry, and astrological sign.
Category: Sentiment Analysis
Access the Dataset : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
2. Amazon Product Dataset
About:
The Amazon Product dataset provides product reviews and metadata from Amazon, spanning from May 1996 to July 2014. With a massive collection of 142.8 million reviews, this dataset offers insights into ratings, text, helpfulness votes, product descriptions, category information, price, brand, image features, and links (also viewed/also bought graphs).
Category: Sentiment Analysis
Access the Dataset: http://jmcauley.ucsd.edu/data/amazon/
3. Multi-Domain Sentiment Dataset
About:
The Multi-Domain Sentiment Dataset contains product reviews extracted from Amazon.com across four product types (domains): kitchen, books, DVDs, and electronics. While the exact number of reviews varies by domain, each domain provides several thousand reviews. The dataset includes star ratings (ranging from 1 to 5 stars) that can be converted into binary labels.
Category: Sentiment Analysis
Access the Dataset : http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html
4. LibriSpeech
About:
LibriSpeech consists of approximately 1000 hours of 16kHz read English speech. This corpus, prepared by Vassil Panayotov with the assistance of Daniel Povey, is derived from reading audiobooks from the LibriVox project. The dataset has undergone meticulous segmentation and alignment to ensure its accuracy and usability.
Category: Speech Recognition
Access the Dataset : http://www.openslr.org/12/
5. Free Spoken Digit Dataset (FSDD)
About:
The Free Spoken Digit Dataset (FSDD) is an open dataset that offers a collection of simple audio/speech recordings of spoken digits in WAV files at 8kHz. The recordings in this dataset have been meticulously trimmed, ensuring minimal silence at the beginnings and ends.
Category: Speech Recognition
Access the Dataset : https://github.com/Jakobovski/free-spoken-digit-dataset
6. Stanford Question Answering Dataset (SQuAD)
About:
The Stanford Question Answering Dataset (SQuAD) serves as a reading comprehension dataset, containing questions posed by crowd-workers based on a set of Wikipedia articles. Each question in the dataset has a corresponding reading passage, and the answer to the question is a segment of text (span) from the passage. SQuAD2.0, an enhanced version, incorporates 100,000 questions from SQuAD1.1 and over 50,000 new unanswerable questions designed to resemble answerable ones.
Category: Question & Answering Analysis
Access the Dataset : https://rajpurkar.github.io/SQuAD-explorer/
7. Jeopardy! Questions in a JSON File
About:
This dataset is a JSON file that includes 216,930 Jeopardy questions, answers, and additional data. Over the course of the show’s history, the total number of Jeopardy! questions has reached 252,583. The dataset has an approximate file size of 53 MB.
Category: Questions & Answers Analysis
Access the Dataset : https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
8. Yelp Reviews
About:
The Yelp dataset serves as a comprehensive collection of Yelp’s businesses, reviews, and user data. This versatile dataset is suitable for personal, educational, and academic purposes. It includes a vast amount of data, comprising 6,685,900 reviews, 200,000 pictures, and 192,609 businesses across ten metropolitan areas.
Category: Text Classification
Access the Dataset : https://www.yelp.com/dataset
9. WordNet
About:
WordNet is an extensive lexical database of English that organizes words based on their meanings. Similar to a thesaurus, WordNet groups words together by synonymy, such as “shut” and “close” or “car” and “automobile.” The dataset contains a total of 117,000 synsets, each interconnected through a small number of conceptual relations.
Category: Text Classification
Access the Dataset : https://wordnet.princeton.edu/
10. TIMIT
About:
The TIMIT Acoustic-Phonetic Continuous Speech Corpus is specifically designed to provide speech data for acoustic-phonetic studies and the development and evaluation of automatic speech recognition systems. This dataset includes broadband recordings of 630 speakers representing eight major dialects of American English. Each speaker reads ten phonetically rich sentences.
Category: Speech Recognition
Access the Dataset : https://catalog.ldc.upenn.edu/LDC93S1
11. IMDB Movie Review Dataset
About:
The IMDB Movie Review Dataset comprises a collection of movie reviews extracted from the IMDB website. The dataset includes labeled sentiment analysis data, with reviews classified as positive or negative. It serves as an excellent resource for training sentiment analysis models.
Category: Sentiment Analysis
Access the Dataset : https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
12. Twitter Sentiment Analysis Dataset
About:
The Twitter Sentiment Analysis Dataset is a collection of tweets extracted from the Twitter platform. The dataset contains labeled sentiment analysis data, with tweets categorized as positive, negative, or neutral. This dataset is ideal for training sentiment analysis models with a focus on social media sentiment.
Category: Sentiment Analysis
Access the Dataset : https://www.kaggle.com/datasets/kazanova/sentiment140
13. Gutenberg eBooks Corpus
About:
The Gutenberg eBooks Corpus is a vast collection of freely available eBooks from Project Gutenberg. This dataset comprises a wide range of literary works, including novels, plays, poems, and non-fiction texts. It serves as an excellent resource for various NLP tasks, such as text classification and language modeling.
Category: Text Classification
Access the Dataset: https://huggingface.co/datasets/biglam/gutenberg-poetry-corpus
14. Reuters News Corpus
About:
The Reuters News Corpus is a comprehensive dataset consisting of news articles from the Reuters news agency. This dataset covers a wide range of topics, including finance, politics, sports, and technology. With a large collection of articles, it is suitable for various NLP tasks such as text classification, topic modeling, and sentiment analysis.
Category: Text Classification
Access the Dataset: https://paperswithcode.com/dataset/reuters-21578
15. WikiText-103
About:
The WikiText-103 dataset is a large-scale language modeling dataset derived from Wikipedia articles. It comprises over 100 million tokens and is widely used for training and evaluating language models. This dataset provides a rich and diverse range of text, making it valuable for language modeling and text generation tasks.
Category: Language Modeling
Access the Dataset : https://developer.ibm.com/exchanges/data/all/wikitext-103/
16. MovieLens Dataset
About:
The MovieLens dataset is a popular dataset for recommender systems and movie-related analysis. It includes movie ratings, user information, and movie metadata. With millions of ratings and diverse movie genres, this dataset is suitable for building recommendation engines and conducting movie-related research.
Category: Recommender Systems
Access the Dataset : https://grouplens.org/datasets/movielens/
17. CoNLL 2003
About:
The CoNLL 2003 dataset is widely used for named entity recognition (NER) tasks. It consists of news articles from the Reuters Corpus labeled with named entity annotations for entities like persons, organizations, and locations. This dataset is essential for developing and evaluating NER models.
Category: Named Entity Recognition
Access the Dataset : https://huggingface.co/datasets/conll2003
18. EuroParl
About:
The EuroParl dataset is a collection of parallel texts from the proceedings of the European Parliament. It contains translated versions of parliamentary speeches in multiple languages. This dataset is valuable for tasks such as machine translation, cross-lingual information retrieval, and multilingual NLP.
Category: Machine Translation
Access the Dataset : https://www.statmt.org/europarl/
19. Cornell Movie Dialogs Corpus
About:
The Cornell Movie Dialogs Corpus is a dataset of movie conversations containing a wide range of dialogues between characters. It includes metadata such as movie titles, genres, and dialogue lines. This dataset is often used for tasks like dialogue generation, sentiment analysis, and language understanding.
Category: Dialogue Systems
Access the Dataset : https://huggingface.co/datasets/cornell_movie_dialog
20. OpenSubtitles
About:
The OpenSubtitles dataset is a large collection of movie and TV show subtitles in various languages. It provides a wealth of text data suitable for tasks like machine translation, dialogue systems, and language modeling. With subtitles from diverse genres and languages, it offers a valuable resource for multilingual NLP.
Category: Multilingual NLP
Access the Dataset : https://autonlp.ai/datasets/opensubtitles
By utilizing these 20 exceptional free and open-source NLP datasets, you can empower your NLP project with high-quality data. These datasets cover a wide range of categories and tasks, including sentiment analysis, speech recognition, question answering analysis, text classification, recommender systems, and more. Remember to properly attribute and comply with the respective dataset licenses and terms of use.
Leave a Reply