Datasets for single-label text categorization

This page makes available some files containing the terms I obtained by pre-processing some well-known datasets used for text categorization.

I did not create the datasets. I am simply making available already processed versions of them, for three main reasons:

I make them available here on the same terms as they were originally available, which is basically for research purposes. If you want to use them for any other purpose, please ask for permission from the original creator. You can reach their homepages by following the links next to each one of them.

20 Newsgroups

I downloaded the 20Newsgroups dataset from Jason Rennie's page and used the "bydate" version, because it already had a standard train/test split. This dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Although already cleaned-up, this dataset still had several attachments, many PGP keys and some duplicates.

After removing them and the messages that became empty because of it the distribution of train and test messages was the following for each newsgroup:

20 Newsgroups
Class# train docs# test docsTotal # docs
alt.atheism480319799
comp.graphics584389973
comp.os.ms-windows.misc572394966
comp.sys.ibm.pc.hardware590392982
comp.sys.mac.hardware578385963
comp.windows.x593392985
misc.forsale585390975
rec.autos594395989
rec.motorcycles598398996
rec.sport.baseball597397994
rec.sport.hockey600399999
sci.crypt595396991
sci.electronics591393984
sci.med594396990
sci.space593394987
soc.religion.christian598398996
talk.politics.guns545364909
talk.politics.mideast564376940
talk.politics.misc465310775
talk.religion.misc377251628
Total11293752818821

Reuters 21578

I downloaded the Reuters-21578 dataset from David Lewis' page and used the standard "modApté" train/test split. These documents appeared on the Reuters newswire in 1987 and were manually classified by personnel from Reuters Ltd.

Due to the fact that the class distribution for these documents is very skewed, two sub-collections are usually considered for text categorization tasks (see this paper):

Moreover, many of these documents are classified as having no topic at all or with more than one topic. In fact, you can see the distribution of the documents per number of topics in the following table, where # train docs and # test docs refer to the Mod Apté split and # other refers to documents that were not considered in this split:

Reuters 21578
# Topics# train docs# test docs# otherTotal # docs
01828280810310211
1655225813619494
28903091351334
31916455310
4623210104
53914861
6216330
774011
84206
94206
103104
110112
121102
130000
140202
150000
161001

As the goal in this page is to consider single-labeled datasets, all the documents with less than or with more than one topic were eliminated. With this some of the classes in R10 and R90 were left with no train or test documents.

Considering only the documents with a single topic and the classes which still have at least one train and one test example, we have 8 of the 10 most frequent classes and 52 of the original 90.

Following Sebastiani's convention, we will call these sets R8 and R52. Note that from R10 to R8 the classes corn and wheat, which are intimately related to the class grain disapeared and this last class lost many of its documents.

The distribution of documents per class is the following for R8 and R52:

R8
Class# train docs# test docsTotal # docs
acq15966962292
crude253121374
earn284010833923
grain411051
interest19081271
money-fx20687293
ship10836144
trade25175326
Total548521897674

R52
Class# train docs# test docsTotal # docs
acq15966962292
alum311950
bop22931
carcass6511
cocoa461561
coffee9022112
copper311344
cotton15924
cpi541771
cpu314
crude253121374
dlr336
earn284010833923
fuel4711
gas10818
gnp581573
gold702090
grain411051
heat6410
housing15217
income7411
instal-debt516
interest19081271
ipi331144
iron-steel261238
jet213
jobs371249
lead448
lei11314
livestock13518
lumber7411
meal-feed617
money-fx20687293
money-supply12328151
nat-gas241236
nickel314
orange13922
pet-chem13619
platinum123
potato235
reserves371249
retail19120
rubber31940
ship10836144
strategic-metal9615
sugar9725122
tea235
tin171027
trade25175326
veg-oil191130
wpi14923
zinc8513
Total653225689100

Cade

The documents in the Cade12 correspond to a subset of web pages extracted from the CADÊ Web Directory, which points to Brazilian web pages classified by human experts. This directory is available at Cade's Homepage, in Brazilian Portuguese.

A pre-processed version of this dataset was made available to me by Marco Cristo, from Universidade Federal de Minas Gerais, in Brazil. This dataset is part of project Gerindo.

Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.

For this particular split, the distribution of documents per class is the following:

Cade12
Class# train docs# test docsTotal # docs
01--servicos 5627 2846 8473
02--sociedade 4935 2428 7363
03--lazer 3698 1892 5590
04--informatica 2983 1536 4519
05--saude 2118 1053 3171
06--educacao 1912 944 2856
07--internet 1585 796 2381
08--cultura 1494 643 2137
09--esportes 1277 630 1907
10--noticias 701 381 1082
11--ciencias 569 310 879
12--compras-online 423 202 625
Total 27322 13661 40983

WebKB

The documents in the WebKB are webpages collected by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group, and were downloaded from The 4 Universities Data Set Homepage. These pages were collected from computer science departments of various universities in 1997, manually classified into seven different classes: student, faculty, staff, department, course, project, and other.

The class other is a collection of pages that were not deemed the ``main page'' representing an instance of the previous six classes. For example, a particular faculty member may be represented by home page, a publications list, a vitae and several research interests pages. Only the faculty member's home page was placed in the faculty class. The publications list, vitae and research interests pages were all placed in the other category.

For each class, the collection contains pages from four universities: Cornell, Texas, Washington, Wisconsin, and other miscellaneous pages collected from other universities.

I discarded the classes Department and Staff because there were only a few pages from each university. I also discarded the class Other because pages were very different among this class.

Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.

For this particular split, the distribution of documents per class is the following:

WebKB
Class # train docs# test docsTotal # docs
project 336 168 504
course 620 310 930
faculty 750 374 1124
student 1097 544 1641
Total 2803 1396 4199

The files

From here, you can download the files.

20 Newsgroups
Train Test
# documents 11293 docs 7528 docs
all-terms
20ng-train-all-terms
15.91 Mb
20ng-test-all-terms
10.31 Mb
no-short
20ng-train-no-short
14.06 Mb
20ng-test-no-short
9.12 Mb
no-stop
20ng-train-no-stop
10.59 Mb
20ng-test-no-stop
6.86 Mb
stemmed
20ng-train-stemmed
9.46 Mb
20ng-test-stemmed
6.13 Mb

Reuters-21578 R8 Reuters-21578 R52
Train Test Train Test
# documents 5485 docs 2189 docs 6532 docs 2568 docs
all-terms
r8-train-all-terms
3.20 Mb
r8-test-all-terms
1.14 Mb
r52-train-all-terms
4.08 Mb
r52-test-all-terms
1.45 Mb
no-short
r8-train-no-short
2.90 Mb
r8-test-no-short
1.03 Mb
r52-train-no-short
3.71 Mb
r52-test-no-short
1.32 Mb
no-stop
r8-train-no-stop
2.42 Mb
r8-test-no-stop
0.86 Mb
r52-train-no-stop
3.08 Mb
r52-test-no-stop
1.09 Mb
stemmed
r8-train-stemmed
2.13 Mb
r8-test-stemmed
0.76 Mb
r52-train-stemmed
2.71 Mb
r52-test-stemmed
0.96 Mb

Cade12
Train Test
# documents 27322 docs 13661 docs
stemmed
cade-train-stemmed
24.50 Mb
cade-test-stemmed
11.65 Mb

WebKB
Train Test
# documents 2803 docs 1396 docs
stemmed
webkb-train-stemmed
2.40 Mb
webkb-test-stemmed
1.20 Mb

All the files mentioned above in one zip file are available here 48 Mb.

File description

All of these are text files containing one document per line.

Each document is composed by its class and its terms.

Each document is represented by a "word" representing the document's class, a TAB character and then a sequence of "words" delimited by spaces, representing the terms contained in the document.

Pre-processing

Except for the Cade12 dataset, from the original datasets, in order to obtain the present files, I applied the following pre-processing:

  1. all-terms Obtained from the original datasets by applying the following transformations:
    1. Substitute TAB, NEWLINE and RETURN characters by SPACE.
    2. Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
    3. Turn all letters to lowercase.
    4. Substitute multiple SPACES by a single SPACE.
    5. The title/subject of each document is simply added in the beginning of the document's text.
  2. no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
  3. no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
  4. stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining words. Information about stemming can be found here.

Some results

Just to give an idea of the relative hardness of each dataset, I have determined the accuracy that some of the most common classification methods achieve with them. As usual, tfidf term weighting is used to represent document vectors, and they were normalized to unitary length. The stemmed train and test sets were used for each dataset.

The "dumb classifier" is included as a baseline. It ignores the query and always gives as the predicted class the most frequent class in the training set.

Accuracy Values
Classification Method R8 R52 20Ng Cade12WebKb
Dumb classifier 0.49470.42170.05300.20830.3897
Vector Method 0.78890.76870.72400.41420.6447
kNN (k = 10) 0.85240.83220.75930.51200.7256
Centroid (Normalized Sum)0.93560.87170.78850.51480.8266
Naive Bayes 0.96070.86920.81030.57270.8352
SVM (Linear Kernel) 0.96980.93770.82840.52840.8582

Note that, because R8, R52, and WebKB are very skewed, the dumb classifier has a ``reasonable'' performance for these datasets. Also, it is worth noting that, while for R8, R52, 20Ng, and webKB it is possible to find good classifiers, that is, classifiers that achieve a high accuracy, for Cade12 the best we can get does not reach 58% accuracy, even with some of the best classifiers available.

Last updated April 2007.

Go back to Ana's Homepage