Text Classification Made Easy: Explore These 10 Open-Source Datasets

Text classification, a prominent field of research, involves analyzing textual data to extract meaningful information. As per industry sources, the global text analytics market is projected to achieve a Compound Annual Growth Rate (CAGR) of over 20% between 2020 and 2024. Text classification finds its applications in various domains, including automating CRM tasks, enhancing web browsing, and boosting e-commerce activities.

Contents

1. Amazon Reviews Dataset

2. Enron Email Dataset

3. Goodreads Book Reviews

4. IMDB Dataset

5. MovieLens Latest Datasets

6. OpinRank Dataset

7. SMS Spam Collection

8. The Blog Authorship Corpus

9. WordNet

10. Yelp Reviews

In this article, we present a curated list of ten open-source datasets that can be effectively utilized for text classification purposes. These datasets cover a wide range of domains and provide valuable resources for training and evaluating machine learning models. Let’s delve into the details of each dataset:

1. Amazon Reviews Dataset

The Amazon Reviews dataset comprises millions of customer reviews and star ratings from the Amazon platform. This dataset serves as an excellent resource for training sentiment analysis models using fastText. It weighs around 493MB, providing a substantial amount of data for analysis and modeling.

Access the dataset here.

2. Enron Email Dataset

The Enron Email Dataset contains a collection of email data from nearly 150 users, primarily consisting of senior management personnel from the Enron organization. Collected and curated by the CALO Project (A Cognitive Assistant that Learns and Organizes), this dataset encompasses approximately 0.5 million messages. Analyzing this dataset can provide valuable insights into email communication patterns and behaviors.

Access the dataset here.

3. Goodreads Book Reviews

The Goodreads Book Reviews dataset offers a vast collection of reviews sourced from the popular Goodreads book review website. This dataset includes various attributes describing the items, such as reviews, reading status, review actions, book attributes, and more. With over 1.5 million items, this dataset enables researchers to explore text classification in the realm of book reviews.

Access the dataset here.

4. IMDB Dataset

The IMDB dataset features 50,000 movie reviews suitable for natural language processing (NLP) and text analytics tasks. Designed for binary sentiment classification, this dataset contains 25,000 positive and 25,000 negative movie reviews. It serves as a benchmark dataset for sentiment analysis and allows researchers to develop robust text classification models.

Access the dataset here.

5. MovieLens Latest Datasets

The MovieLens Latest Datasets are a comprehensive collection of movie-related data, including ratings, tag applications, and user information. This dataset comprises two subsets gathered over a specific period. The small set encompasses 100,000 ratings, 3,600 tag applications, and 9,000 movies by 600 users. In contrast, the large set contains a massive 27 million ratings, 1.1 million tag applications, and 58,000 movies by 280,000 users. The large set also includes tag genome data with 14 million relevance scores across 1,100 tags.

Access the dataset here.

6. OpinRank Dataset

The OpinRank Dataset provides a comprehensive collection of reviews for both hotels and cars sourced from popular platforms like Tripadvisor and Edmunds. Researchers can benefit from the dataset’s diverse range of hotel reviews across ten different cities and car reviews spanning model-years 2007, 2008, and 2009. The dataset comprises approximately 42,230 car reviews and 259,000 hotel reviews, making it a valuable resource for text classification tasks.

Access the dataset here.

7. SMS Spam Collection

The SMS Spam Collection is a publicly available dataset specifically designed for mobile phone spam research. It consists of 5,574 English messages that are tagged as either legitimate or spam. This dataset aids researchers in developing effective spam detection models by leveraging natural language processing techniques. The dataset is accessible in both plain text and ARFF format.

Access the dataset here.

8. The Blog Authorship Corpus

The Blog Authorship Corpus comprises a vast collection of posts from 19,320 bloggers, which were obtained from the blogging platform blogger.com in August 2004. This corpus encompasses a total of 681,288 posts, amounting to over 140 million words, with an average of 35 posts and 7,250 words per person. Each blog post is presented as a separate file, denoting the blogger’s ID, self-provided gender, age, industry, and astrological sign.

Access the dataset here.

9. WordNet

WordNet is a comprehensive lexical database of the English language, grouping nouns, verbs, adjectives, and adverbs into sets called synsets. Each synset represents a distinct concept and is connected to other synsets through a limited number of conceptual relations. This dataset consists of approximately 117,000 synsets, providing researchers with a valuable resource for semantic analysis and text classification tasks.

Access the dataset here.

10. Yelp Reviews

The Yelp dataset serves as an all-purpose resource for learning, offering a subset of Yelp’s businesses, reviews, and user data. This dataset comprises a staggering 6,685,900 reviews, 200,000 pictures, and 192,609 businesses from ten different metropolitan areas. It can be leveraged for various purposes, including personal, educational, and academic projects.

Access the dataset here.