This page makes available some files containing the terms I obtained by pre-processing some well-known datasets used for text categorization.
I did not create the datasets. I am simply making available already processed versions of them, for three main reasons:
I make them available here on the same terms as they were originally available, which is basically for research purposes. If you want to use them for any other purpose, please ask for permission from the original creator. You can reach their homepages by following the links next to each one of them.
I downloaded the 20Newsgroups dataset from Jason Rennie's page and used the "bydate" version, because it already had a standard train/test split. This dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
Although already cleaned-up, this dataset still had several attachments, many PGP keys and some duplicates.
After removing them and the messages that became empty because of it the distribution of train and test messages was the following for each newsgroup:
| 20 Newsgroups | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| alt.atheism | 480 | 319 | 799 |
| comp.graphics | 584 | 389 | 973 |
| comp.os.ms-windows.misc | 572 | 394 | 966 |
| comp.sys.ibm.pc.hardware | 590 | 392 | 982 |
| comp.sys.mac.hardware | 578 | 385 | 963 |
| comp.windows.x | 593 | 392 | 985 |
| misc.forsale | 585 | 390 | 975 |
| rec.autos | 594 | 395 | 989 |
| rec.motorcycles | 598 | 398 | 996 |
| rec.sport.baseball | 597 | 397 | 994 |
| rec.sport.hockey | 600 | 399 | 999 |
| sci.crypt | 595 | 396 | 991 |
| sci.electronics | 591 | 393 | 984 |
| sci.med | 594 | 396 | 990 |
| sci.space | 593 | 394 | 987 |
| soc.religion.christian | 598 | 398 | 996 |
| talk.politics.guns | 545 | 364 | 909 |
| talk.politics.mideast | 564 | 376 | 940 |
| talk.politics.misc | 465 | 310 | 775 |
| talk.religion.misc | 377 | 251 | 628 |
| Total | 11293 | 7528 | 18821 |
I downloaded the Reuters-21578 dataset from David Lewis' page and used the standard "modApté" train/test split. These documents appeared on the Reuters newswire in 1987 and were manually classified by personnel from Reuters Ltd.
Due to the fact that the class distribution for these documents is very skewed, two sub-collections are usually considered for text categorization tasks (see this paper):
Moreover, many of these documents are classified as having no topic at all or with more than one topic. In fact, you can see the distribution of the documents per number of topics in the following table, where # train docs and # test docs refer to the Mod Apté split and # other refers to documents that were not considered in this split:
| Reuters 21578 | ||||
|---|---|---|---|---|
| # Topics | # train docs | # test docs | # other | Total # docs |
| 0 | 1828 | 280 | 8103 | 10211 |
| 1 | 6552 | 2581 | 361 | 9494 |
| 2 | 890 | 309 | 135 | 1334 |
| 3 | 191 | 64 | 55 | 310 |
| 4 | 62 | 32 | 10 | 104 |
| 5 | 39 | 14 | 8 | 61 |
| 6 | 21 | 6 | 3 | 30 |
| 7 | 7 | 4 | 0 | 11 |
| 8 | 4 | 2 | 0 | 6 |
| 9 | 4 | 2 | 0 | 6 |
| 10 | 3 | 1 | 0 | 4 |
| 11 | 0 | 1 | 1 | 2 |
| 12 | 1 | 1 | 0 | 2 |
| 13 | 0 | 0 | 0 | 0 |
| 14 | 0 | 2 | 0 | 2 |
| 15 | 0 | 0 | 0 | 0 |
| 16 | 1 | 0 | 0 | 1 |
As the goal in this page is to consider single-labeled datasets, all the documents with less than or with more than one topic were eliminated. With this some of the classes in R10 and R90 were left with no train or test documents.
Considering only the documents with a single topic and the classes which still have at least one train and one test example, we have 8 of the 10 most frequent classes and 52 of the original 90.
Following Sebastiani's convention, we will call these sets R8 and R52. Note that from R10 to R8 the classes corn and wheat, which are intimately related to the class grain disapeared and this last class lost many of its documents.
The distribution of documents per class is the following for R8 and R52:
| R8 | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| acq | 1596 | 696 | 2292 |
| crude | 253 | 121 | 374 |
| earn | 2840 | 1083 | 3923 |
| grain | 41 | 10 | 51 |
| interest | 190 | 81 | 271 |
| money-fx | 206 | 87 | 293 |
| ship | 108 | 36 | 144 |
| trade | 251 | 75 | 326 |
| Total | 5485 | 2189 | 7674 |
| R52 | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| acq | 1596 | 696 | 2292 |
| alum | 31 | 19 | 50 |
| bop | 22 | 9 | 31 |
| carcass | 6 | 5 | 11 |
| cocoa | 46 | 15 | 61 |
| coffee | 90 | 22 | 112 |
| copper | 31 | 13 | 44 |
| cotton | 15 | 9 | 24 |
| cpi | 54 | 17 | 71 |
| cpu | 3 | 1 | 4 |
| crude | 253 | 121 | 374 |
| dlr | 3 | 3 | 6 |
| earn | 2840 | 1083 | 3923 |
| fuel | 4 | 7 | 11 |
| gas | 10 | 8 | 18 |
| gnp | 58 | 15 | 73 |
| gold | 70 | 20 | 90 |
| grain | 41 | 10 | 51 |
| heat | 6 | 4 | 10 |
| housing | 15 | 2 | 17 |
| income | 7 | 4 | 11 |
| instal-debt | 5 | 1 | 6 |
| interest | 190 | 81 | 271 |
| ipi | 33 | 11 | 44 |
| iron-steel | 26 | 12 | 38 |
| jet | 2 | 1 | 3 |
| jobs | 37 | 12 | 49 |
| lead | 4 | 4 | 8 |
| lei | 11 | 3 | 14 |
| livestock | 13 | 5 | 18 |
| lumber | 7 | 4 | 11 |
| meal-feed | 6 | 1 | 7 |
| money-fx | 206 | 87 | 293 |
| money-supply | 123 | 28 | 151 |
| nat-gas | 24 | 12 | 36 |
| nickel | 3 | 1 | 4 |
| orange | 13 | 9 | 22 |
| pet-chem | 13 | 6 | 19 |
| platinum | 1 | 2 | 3 |
| potato | 2 | 3 | 5 |
| reserves | 37 | 12 | 49 |
| retail | 19 | 1 | 20 |
| rubber | 31 | 9 | 40 |
| ship | 108 | 36 | 144 |
| strategic-metal | 9 | 6 | 15 |
| sugar | 97 | 25 | 122 |
| tea | 2 | 3 | 5 |
| tin | 17 | 10 | 27 |
| trade | 251 | 75 | 326 |
| veg-oil | 19 | 11 | 30 |
| wpi | 14 | 9 | 23 |
| zinc | 8 | 5 | 13 |
| Total | 6532 | 2568 | 9100 |
The documents in the Cade12 correspond to a subset of web pages extracted from the CADÊ Web Directory, which points to Brazilian web pages classified by human experts. This directory is available at Cade's Homepage, in Brazilian Portuguese.
A pre-processed version of this dataset was made available to me by Marco Cristo, from Universidade Federal de Minas Gerais, in Brazil. This dataset is part of project Gerindo.
Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.
For this particular split, the distribution of documents per class is the following:
| Cade12 | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| 01--servicos | 5627 | 2846 | 8473 |
| 02--sociedade | 4935 | 2428 | 7363 |
| 03--lazer | 3698 | 1892 | 5590 |
| 04--informatica | 2983 | 1536 | 4519 |
| 05--saude | 2118 | 1053 | 3171 |
| 06--educacao | 1912 | 944 | 2856 |
| 07--internet | 1585 | 796 | 2381 |
| 08--cultura | 1494 | 643 | 2137 |
| 09--esportes | 1277 | 630 | 1907 |
| 10--noticias | 701 | 381 | 1082 |
| 11--ciencias | 569 | 310 | 879 |
| 12--compras-online | 423 | 202 | 625 |
| Total | 27322 | 13661 | 40983 |
The documents in the WebKB are webpages collected by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group, and were downloaded from The 4 Universities Data Set Homepage. These pages were collected from computer science departments of various universities in 1997, manually classified into seven different classes: student, faculty, staff, department, course, project, and other.
The class other is a collection of pages that were not deemed the ``main page'' representing an instance of the previous six classes. For example, a particular faculty member may be represented by home page, a publications list, a vitae and several research interests pages. Only the faculty member's home page was placed in the faculty class. The publications list, vitae and research interests pages were all placed in the other category.
For each class, the collection contains pages from four universities: Cornell, Texas, Washington, Wisconsin, and other miscellaneous pages collected from other universities.
I discarded the classes Department and Staff because there were only a few pages from each university. I also discarded the class Other because pages were very different among this class.
Because there is no standard train/test split for this dataset, and in order to be consistent with the previous ones, I randomly chose two thirds of the documents for training and the remaining third for testing.
For this particular split, the distribution of documents per class is the following:
| WebKB | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| project | 336 | 168 | 504 |
| course | 620 | 310 | 930 |
| faculty | 750 | 374 | 1124 |
| student | 1097 | 544 | 1641 |
| Total | 2803 | 1396 | 4199 |
| 20 Newsgroups | ||
|---|---|---|
| Train | Test | |
| # documents | 11293 docs | 7528 docs |
all-terms |
20ng-train-all-terms 15.91 Mb |
20ng-test-all-terms 10.31 Mb |
no-short |
20ng-train-no-short 14.06 Mb |
20ng-test-no-short 9.12 Mb |
no-stop |
20ng-train-no-stop 10.59 Mb |
20ng-test-no-stop 6.86 Mb |
stemmed |
20ng-train-stemmed 9.46 Mb |
20ng-test-stemmed 6.13 Mb |
| Reuters-21578 R8 | Reuters-21578 R52 | |||
|---|---|---|---|---|
| Train | Test | Train | Test | |
| # documents | 5485 docs | 2189 docs | 6532 docs | 2568 docs |
all-terms |
r8-train-all-terms 3.20 Mb |
r8-test-all-terms 1.14 Mb |
r52-train-all-terms 4.08 Mb |
r52-test-all-terms 1.45 Mb |
no-short |
r8-train-no-short 2.90 Mb |
r8-test-no-short 1.03 Mb |
r52-train-no-short 3.71 Mb |
r52-test-no-short 1.32 Mb |
no-stop |
r8-train-no-stop 2.42 Mb |
r8-test-no-stop 0.86 Mb |
r52-train-no-stop 3.08 Mb |
r52-test-no-stop 1.09 Mb |
stemmed |
r8-train-stemmed 2.13 Mb |
r8-test-stemmed 0.76 Mb |
r52-train-stemmed 2.71 Mb |
r52-test-stemmed 0.96 Mb |
| Cade12 | ||
|---|---|---|
| Train | Test | |
| # documents | 27322 docs | 13661 docs |
stemmed |
cade-train-stemmed 24.50 Mb |
cade-test-stemmed 11.65 Mb |
| WebKB | ||
|---|---|---|
| Train | Test | |
| # documents | 2803 docs | 1396 docs |
stemmed |
webkb-train-stemmed 2.40 Mb |
webkb-test-stemmed 1.20 Mb |
All the files mentioned above in one zip file are available here 48 Mb.
All of these are text files containing one document per line.
Each document is composed by its class and its terms.
Each document is represented by a "word" representing the document's class, a TAB character and then a sequence of "words" delimited by spaces, representing the terms contained in the document.
Except for the Cade12 dataset, from the original datasets, in order to obtain the present files, I applied the following pre-processing:
Just to give an idea of the relative hardness of each dataset, I have determined the accuracy that some of the most common classification methods achieve with them. As usual, tfidf term weighting is used to represent document vectors, and they were normalized to unitary length. The stemmed train and test sets were used for each dataset.
The "dumb classifier" is included as a baseline. It ignores the query and always gives as the predicted class the most frequent class in the training set.
| Accuracy Values | |||||
|---|---|---|---|---|---|
| Classification Method | R8 | R52 | 20Ng | Cade12 | WebKb |
| Dumb classifier | 0.4947 | 0.4217 | 0.0530 | 0.2083 | 0.3897 |
| Vector Method | 0.7889 | 0.7687 | 0.7240 | 0.4142 | 0.6447 |
| kNN (k = 10) | 0.8524 | 0.8322 | 0.7593 | 0.5120 | 0.7256 |
| Centroid (Normalized Sum) | 0.9356 | 0.8717 | 0.7885 | 0.5148 | 0.8266 |
| Naive Bayes | 0.9607 | 0.8692 | 0.8103 | 0.5727 | 0.8352 |
| SVM (Linear Kernel) | 0.9698 | 0.9377 | 0.8284 | 0.5284 | 0.8582 |
Note that, because R8, R52, and WebKB are very skewed, the dumb classifier has a ``reasonable'' performance for these datasets. Also, it is worth noting that, while for R8, R52, 20Ng, and webKB it is possible to find good classifiers, that is, classifiers that achieve a high accuracy, for Cade12 the best we can get does not reach 58% accuracy, even with some of the best classifiers available.
Last updated April 2007.
Go back to Ana's Homepage