Datasets

# Introduction This webpage describes the data I used in my research. The data is devided into the following categories: 1. Text 2. Network 3. Documents 4. Academic Papers Data in each category is cleaned into the same format for easy accessing. The description can be found in a seperated section. --- # Datasets <table style="text-align:right"> <thead> <tr> <th colspan=8 style="text-align:center">[Text](#h3_text)</th> </tr> <tr> <th style="text-align:right">Datasets</th> <th style="text-align:right">Vocabulary</th> <th style="text-align:right">Corpus</th> <th style="text-align:right" colspan=4>Description</th> <th style="text-align:right">Origial source</th> </tr> </thead> <tbody> <tr> <td style="vertical-align: middle;">[Text8](#h3.1_text8-a-href-.-text8.txt.gz-download-a-)</td> <td style="vertical-align: middle;">253,854</td> <td style="vertical-align: middle;">17 million</td> <td style="vertical-align: middle;" colspan=4>First $10^8$ bytes of [enwik9](http://mattmahoney.net/dc/textdata.html). Widely used for demonstrating the performance of word embeddings</td> <td style="vertical-align: middle;">[mattmahoney](http://mattmahoney.net/dc/text8.zip)</td> </tr> <tr> <td style="vertical-align: middle;">[News2010](#h3.2_news2010-a-href-.-news2010.txt.gz-download-a-)</td> <td style="vertical-align: middle;">1,952,790</td> <td style="vertical-align: middle;">136 million</td> <td style="vertical-align: middle;" colspan=4>Text extracted from online news.</td> <td style="vertical-align: middle;">[wmt14](http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2010.en.shuffled.gz)</td> </tr> <tr> <td style="vertical-align: middle;">[Wikipedia dump](#h3.3_wikipedia-dump-a-href-.-wiki20180201.txt.gz-download-a-)</td> <td style="vertical-align: middle;">9,111,933</td> <td style="vertical-align: middle;">2,435 million</td> <td style="vertical-align: middle;" colspan=4>English articles from 20180201 Wikipedia dump. Text are extracted by [gensim WikiCorpus utils](https://radimrehurek.com/gensim/corpora/wikicorpus.html).</td> <td style="vertical-align: middle;">[wikimedia](https://dumps.wikimedia.org/)</td> </tr> </tboday> <thead> <tr> <th colspan=8 style="text-align:center">[Network](#h4_network)</th> </tr> <tr> <th style="text-align:right">Datasets</th> <th style="text-align:right">Node</th> <th style="text-align:right">Edge</th> <th style="text-align:right">Labels</th> <th style="text-align:right">Context/Attributes</th> <th style="text-align:right" colspan=2>Description</th> <th style="text-align:right">Origial source</th> </tr> </thead> <tbody> <tr> <td style="vertical-align: middle;">[WebKB](#h4.1_webkb-a-href-.-webkb.tar.gz-download-a-)</td> <td style="vertical-align: middle;">877</td> <td style="vertical-align: middle;">1,608</td> <td style="vertical-align: middle;">5</td> <td style="vertical-align: middle;">Context as 0/1 Vectors</td> <td style="vertical-align: middle;" colspan=2>Web pages from four computer science departments</td> <td style="vertical-align: middle;">[LINQS](https://linqs.soe.ucsc.edu/data)</td> </tr> <tr> <td style="vertical-align: middle;">[Cora](#h4.2_cora-a-href-.-cora.tar.gz-download-a-)</td> <td style="vertical-align: middle;">2,708</td> <td style="vertical-align: middle;">5,429</td> <td style="vertical-align: middle;">7</td> <td style="vertical-align: middle;">Context as 0/1 Vectors</td> <td style="vertical-align: middle;" colspan=2>Scientific publications</td> <td style="vertical-align: middle;">[LINQS](https://linqs.soe.ucsc.edu/data)</td> </tr> <tr> <td style="vertical-align: middle;">[CiteSeer](#h4.3_citeseer-a-href-.-citeseer.tar.gz-download-a-)</td> <td style="vertical-align: middle;">3,319</td> <td style="vertical-align: middle;">4,722</td> <td style="vertical-align: middle;">6</td> <td style="vertical-align: middle;">Context as 0/1 Vectors</td> <td style="vertical-align: middle;" colspan=2>Scientific publications</td> <td style="vertical-align: middle;">[LINQS](https://linqs.soe.ucsc.edu/data)</td> </tr> <tr> <td style="vertical-align: middle;">[BlogCatalog](#h4.4_blogcatalog-a-href-.-blogcatalog.tar.gz-download-a-)</td> <td style="vertical-align: middle;">10,312</td> <td style="vertical-align: middle;">333,983</td> <td style="vertical-align: middle;">39(Multi-label)</td> <td style="vertical-align: middle;">--</td> <td style="vertical-align: middle;" colspan=2>Social blog directory website. Widely used for demonstrating the performance of network embeddings</td> <td style="vertical-align: middle;">[link](http://socialcomputing.asu.edu/datasets/BlogCatalog3)</td> </tr> <tr> <td style="vertical-align: middle;">[PubMed](#h4.5_pubmed-a-href-.-pubmed.tar.gz-download-a-)</td> <td style="vertical-align: middle;">19,717</td> <td style="vertical-align: middle;">44,338</td> <td style="vertical-align: middle;">3</td> <td style="vertical-align: middle;">Context as TF-IDF Vectors</td> <td style="vertical-align: middle;" colspan=2>Scientific publications</td> <td style="vertical-align: middle;">[LINQS](https://linqs.soe.ucsc.edu/data)</td> </tr> <tr> <td style="vertical-align: middle;">[Flickr](#h4.6_flickr-a-href-.-flickr.tar.gz-download-a-)</td> <td style="vertical-align: middle;">80,513</td> <td style="vertical-align: middle;">5,899,882</td> <td style="vertical-align: middle;">195(Multi-label)</td> <td style="vertical-align: middle;">--</td> <td style="vertical-align: middle;" colspan=2></td> <td style="vertical-align: middle;">[link](http://socialcomputing.asu.edu/datasets/Flickr)</td> </tr> <tr> <td style="vertical-align: middle;">[YouTube](#h4.7_youtube-a-href-.-youtube.tar.gz-download-a-)</td> <td style="vertical-align: middle;">1,138,499</td> <td style="vertical-align: middle;">2,990,443</td> <td style="vertical-align: middle;">47(Multi-label)</td> <td style="vertical-align: middle;">--</td> <td style="vertical-align: middle;" colspan=2></td> <td style="vertical-align: middle;">[link](http://socialcomputing.asu.edu/datasets/YouTube2)</td> </tr> </tboday> <thead> <tr> <th colspan=8 style="text-align:center">[Documents](#h5_documents)</th> </tr> <tr> <th style="text-align:right">Datasets</th> <th style="text-align:right">Document</th> <th style="text-align:right">Vocabulary</th> <th style="text-align:right">Corpus</th> <th style="text-align:right">Labels</th> <th style="text-align:right" colspan=2>Description</th> <th style="text-align:right">Origial source</th> </tr> </thead> <tboday> <tr> <td style="vertical-align: middle;">[IMDB](#h5.1_imdb-a-href-.-imdb.tar.gz-download-a-)</td> <td style="vertical-align: middle;">100,000</td> <td style="vertical-align: middle;">256,510</td> <td style="vertical-align: middle;">2,6293,872</td> <td style="vertical-align: middle;">25,000 positive,<br/>25,000 negative,<br/>50,000 unlabeled</td> <td style="vertical-align: middle;" colspan=2>Movie Review Dataset from IMDB</td> <td style="vertical-align: middle;">[Stanford](http://ai.stanford.edu/~amaas/data/sentiment/)</td> </tr> <tr> <td style="vertical-align: middle;">[arXiv](#h5.2_arxiv-a-href-.-arxiv.tar.gz-download-a-)</td> <td style="vertical-align: middle;">840,218</td> <td style="vertical-align: middle;">1,144,437</td> <td style="vertical-align: middle;">109,994,801</td> <td style="vertical-align: middle;">127,872 cs,<br/>88,896 physics,<br/>297,094 math,<br/> 326,356 others</td> <td style="vertical-align: middle;" colspan=2>academic papers from arXiv</td> <td style="vertical-align: middle;">[origial data](./org-text.txt)</td> </tr> </tboday> <thead> <tr> <th colspan=8 style="text-align:center">[Academic Papers](#h6_academic-papers)</th> </tr> <tr> <th style="text-align:right">Datasets</th> <th style="text-align:right">Papers</th> <th style="text-align:right">Vocabulary</th> <th style="text-align:right">Corpus</th> <th style="text-align:right">Citations</th> <th style="text-align:right">Labels</th> <th style="text-align:right">Description</th> <th style="text-align:right">Origial source</th> </tr> </thead> <tboday> <tr> <td style="vertical-align: middle;">[Cora Enrich](#h6.1_cora-enrich-a-href-.-cora_enrich.tar.gz-download-a-)</td> <td style="vertical-align: middle;">2,708</td> <td style="vertical-align: middle;">25,955</td> <td style="vertical-align: middle;">2,522,761</td> <td style="vertical-align: middle;">5,429</td> <td style="vertical-align: middle;">7</td> <td style="vertical-align: middle;">Papers in Machine Learning</td> <td style="vertical-align: middle;">[origial data](./cora_enrich.tar.gz)</td> </tr> <tr> <td style="vertical-align: middle;">[AMinerV8](#h6.2_aminerv8-a-href-.-aminerv8.tar.gz-download-a-)</td> <td style="vertical-align: middle;">777,262</td> <td style="vertical-align: middle;">69,606</td> <td style="vertical-align: middle;">5,435,631</td> <td style="vertical-align: middle;">4,191,523</td> <td style="vertical-align: middle;">10</td> <td style="vertical-align: middle;">Papers from DBLP</td> <td style="vertical-align: middle;">[AMiner]()</td> </tr> <tr> <td style="vertical-align: middle;">[AMinerV10](#h6.3_aminerv10-a-href-.-aminerv10.tar.gz-download-a-)</td> <td style="vertical-align: middle;">2,725,523</td> <td style="vertical-align: middle;">588,303</td> <td style="vertical-align: middle;">231,398,743</td> <td style="vertical-align: middle;">25,166,994</td> <td style="vertical-align: middle;">--</td> <td style="vertical-align: middle;">Papers from DBLP</td> <td style="vertical-align: middle;">[AMiner]()</td> </tr> <tr> <td style="vertical-align: middle;">[MAG](#h6.4_mag-a-href-.-mag.tar.gz-download-a-)</td> <td style="vertical-align: middle;">46,642,396</td> <td style="vertical-align: middle;">2,475,973</td> <td style="vertical-align: middle;">402,303,849</td> <td style="vertical-align: middle;">528,682,289</td> <td style="vertical-align: middle;">--</td> <td style="vertical-align: middle;"></td> <td style="vertical-align: middle;">[Microsoft](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)</td> </tr> <tr> <td style="vertical-align: middle;">[Meta](#h6.5_meta-a-href-.-meta.tar.gz-download-a-)</td> <td style="vertical-align: middle;">15,660,195</td> <td style="vertical-align: middle;">1,776,468</td> <td style="vertical-align: middle;">1,616,186,164</td> <td style="vertical-align: middle;">213,036,526</td> <td style="vertical-align: middle;">--</td> <td style="vertical-align: middle;">Papers in domain of Health</td> <td style="vertical-align: middle;">[Meta](http://meta.com)</td> </tr> </tboday> </table> --- # Text These datasets are used to train word embedding algorithms such as word2vec. Note that the analogy task contains the test cases such as `work,working => walk,walking`. Thus, stemming is not needed for such tasks. Therefore, all data sets here are raw data without any preprocessing. Each dataset contains the tokens that are seperated by `' '` or `'\n'`. [//]: # (The test cases are used to evaluate the performance of word embeddings. ) Here is an example of the dataset: ```dos anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used ... ``` ## Text8 [\[Download\]](./text8.txt.gz) Text8 contains the first $10^8$ bytes of [enwik9](http://mattmahoney.net/dc/textdata.html). It is widely used for demonstrating the performance of word embeddings. It is recommond to use this dataset to test your algorithms. ## News2010 [\[Download\]](./news2010.txt.gz) News2010 is a dataset from translation task of the WMT2014 workshops. The article text is crawled and extracted from various online news publications as described [here](http://www.statmt.org/wmt13/translation-task.html). ## Wikipedia dump [\[Download\]](./wiki20180201.txt.gz) We take all English articles from 20180201 [Wikipedia dump](https://dumps.wikimedia.org/enwiki/). Text are extracted by [gensim WikiCorpus utils](https://radimrehurek.com/gensim/corpora/wikicorpus.html). Links are removed, All words are stored in lowercase. Words have less than 2 or more than 15 characters are removed. Articles that contains less than 50 words are removed. In total, 4,408,521 vailed text are extracted. The corpus size is 2,435,723,432 and the vocabulary size is 9,111,933. Words appeared less than 100 times are removed, resulting in vocabulary of 319,591 words. --- # Network There are seven network datasets. Each data contains node and the links between the nodes. Each node is associated with one or more labels. All nodes are connected in the network. In some datasets, such as WebKB, Cora, CiteSeer, and PubMed, nodes has text attributes which is represented as a 0/1 vector or TF-IDF representation. The network is represented as edge list stored in `edges.csv`. The frist element is `source node` and the second element is `target node`. Elements are seperated by `,`. Here is an example: ```[csv] 100197,193931 100197,447250 100197,688361 ... ``` The label information is provided in `group-edges.csv`. The frist element is `node id` and the second element is `group id`. Elements are seperated by `,`. For example: ```[csv] 1000012,Rule_Learning 100197,Neural_Networks 100701,Case_Based 100935,Genetic_Algorithms 100961,Neural_Networks ... ``` In some datasets, BlogCatalog, Flickr and YouTube, nodes can have multiple labels. ## WebKB [\[Download\]](./WebKB.tar.gz) WebKB Data Set contains 877 web pages from four computer science departments. There are 1,608 links between the webpages. The webpages are classified into five categories: * Course * Faculty * Student * Project * Staff. The dataset contains 1,703 unique words. ## Cora [\[Download\]](./Cora.tar.gz) Cora dataset contains 2,708 scientific publications in machine learning, connected by 5,429 citation links. Each paper is manually labeled as one of seven categories: * Case Based * Genetic Algorithms * Neural Networks * Probabilistic Methods * Reinforcement Learning * Rule Learning * Theory Text information is provided by a binary vector indicating the absence or presence of the corresponding word in the vocabulary regardless of the order and frequency. Each paper contains title and abstract, and the word appears less than 10 times are removed from vocabulary. The vocabulary size is 1,433. All papers have text and at least one neighbor in the citation graph. The average degree of the citation network is 2.00. ## Citeseer [\[Download\]](./CiteSeer.tar.gz) Citeseer dataSet contains 3,319 publications with 4,722 reference links. Papers are categorized into 6 classes: * Agents * AI * DB * IR * ML * HCI. There are 3,703 unique words. ## BlogCatalog [\[Download\]](./BlogCatalog.tar.gz) BlogCatalog is a social blog directory website. It is an undirected unwieghted graph that has 333,983 edges and 10,312 nodes labeled into 39 groups. Each node has been assigned to at least one label. [//]: # (The data is crawled by \cite{tang_relational_2009,tang_scalable_2009}. ) ## PubMed [\[Download\]](./PubMed.tar.gz) The Pubmed dataset consists of 19,717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes: * Diabetes Mellitus, Experimental * Diabetes Mellitus Type 1 * Diabetes Mellitus Type 2 The citation network consists of 44,338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words. ## Flickr [\[Download\]](./Flickr.tar.gz) Flickr is an image and video hosting website, web services suite, and online community. The data contains friendship network and group memberships. It has 80,513 nodes and 5,899,882 edges. The nodes have been labeled into 195 groups. Nodes can have multi labels. [//]: # (This data is crawled by \cite{tang_relational_2009,tang_scalable_2009}) ## Youtube [\[Download\]](./YouTube.tar.gz) YouTube is a video-sharing websites. It has 1,138,499 nodes and 2,990,443 edges with 47 labels. [//]: # (this data is crawled by \cite{tang_relational_2009,tang_scalable_2009}.) --- # Documents Documents are used to train document embeddings and evaluated through classification task. **Pre-processing** The text is preprocessed by [nltk](https://www.nltk.org/) as following: 1. Tokenlize with `nltk.tokenize.RegexpTokenizer(r'\b[a-zA-Z]+\b')` 2. Remove stopwords provided from `nltk.corpus.stopwords.words('english')` 3. Stem words with `nltk.stem.PorterStemmer()` ## IMDB [\[Download\]](./IMDB.tar.gz) This is a dataset for binary sentiment classification. The data contains 100,000 movie reviews classified into three catogorier: * 25,000 positive reviews * 25,000 negative reviews * 50,000 unlabled reviews Raw text are provided. ## arXiv [\[Download\]](./arXiv.tar.gz) ArXiv dataset contains title and abstracts of 840,218 papers from arXiv. The papers are classified into four groups: * 127,872 Computer Science * 88,896 Physics * 297,094 Math * 326,356 Others --- # Academic Papers Academic papers contains not only plain text, but also linked to each other through hyper-links such as citations and references. **Pre-processing** The text is preprocessed by [nltk](https://www.nltk.org/) as following: 1. Tokenlize with `nltk.tokenize.RegexpTokenizer(r'\b[a-zA-Z]+\b')` 2. Remove stopwords provided from `nltk.corpus.stopwords.words('english')` 3. Stem words with `nltk.stem.PorterStemmer()` Meanwhile, we filter out the papers to make sure every paper has 1. Text. 2. At least one citation or reference in the network. The following datasets had been cleaned into a fixed format for fast accessing. Each datasets contains at least three files: `idxs.txt`, `links.txt`, and `texts.txt`. `labels.txt` is presented if the labels are avaliable. Each line of the files corresponding to an attributes of a paper. For example, the 100 line of these files are: ```[text] ==> idxs.txt <== ... 99023 ... ==> texts.txt <== ... fast fix point algorithm independ compon analysi paper appear neural comput abstract introduc novel fast algorithm independ compon analysi use blind sourc separ featur extract shown neural network learn rule transform txed point iter provid algorithm simpl depend user detn paramet fast converg accur solut allow data algorithm tnd one time non gaussian independ compon ... ... ==> links.txt <== ... 578306 578309 578347 ... ==> labels.txt <== ... Neural_Networks ... ``` This means the paper id is `99023`, text is `fast fix ...`. It has reference of `578306 578309 578347` and the label is `Neural_Networks`. Note the citation is seperated by `space`. ## Cora Enrich [\[Download\]](./Cora_enrich.tar.gz) Cora dataset contains 2,708 scientific publications in machine learning, connected by 5,429 citation links. Each paper is manually labeled as one of seven categories: * Case Based * Genetic Algorithms * Neural Networks * Probabilistic Methods * Reinforcement Learning * Rule Learning * Theory Ganguly et. al. enriched the text information of [Cora](#h4.2_cora-a-href-.-cora.tar.gz-download-a-). They collect the titles, abstracts and all sentences from a paper containing citations, which leads to 25,955 vocabulary, 2,522,761 Corpus. Each paper contains 937.83 words on average. This dataset shares the same papers, categories and citation network with [Cora](#h4.2_cora-a-href-.-cora.tar.gz-download-a-) dataset. ## AMinerV8 [\[Download\]](./AMinerV8.tar.gz) AMiner contains 777,262 papers and 2,146,341 citations. The average degree is 2.7614. AMiner labeled 61,256 papers into one of ten categories * Artificial intelligence * Database:Data mining : Information retrieval * Information security * Theoretical computer science * Compter graphics:Multimedia * High-Performance Computing * Interdisciplinary Studies * Computer networks * Human computer interaction : Ubiquitous computing * Software engineering After pre-processing, each paper contains 7.23 words on average. ## AMinerV10 [\[Download\]](./AMinerV10.tar.gz) AMiner released citation network V10 on October 27, 2017. The orignal data contains 3,079,007 papers and 25,166,994 citation relationships. Each paper contains id, title authors, venue, year, references and abstract. Labels are not provided by the publisher. In our work, we preprocessed the data to ensure each paper contain both text and at least one citation in the network. After preprocessing, there are 2,725,523 papers and 25,166,994 citation links in the dataset. We treat the venue as the labels for papers. More specifically, we filtered all papers published in arXiv, where the venue is the category that selected by the author manually. We then the categories that has more than 1,000 papers. 27,972 papers are selected into 14 categories * Information Theory * Computer Vision and Pattern Recognition * Learning * Networking and Internet Architecture * Computation and Language * Artificial Intelligence * Data Structures and Algorithms * Cryptography and Security * Distributed, Parallel, and Cluster Computing * Software Engineering * Machine Learning * Logic in Computer Science * Social and Information Networks * Systems and Control ## MAG [\[Download\]](./MAG.tar.gz) Microsoft released [Microsoft Academic Grpah](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) on 2015. After preprocessing, there are 46.64 million papers and 528.68 links. We use the title as the text so that each paper has 8.63 words averagely. Some papers has journals or conference the paper has been published to. In our work, we use it as the ground true label. The data does not provide ground true label. We treat the venue as the labels for papers and perform classification over following venues: * AAAI * ICASSP ## Meta [\[Download\]](./Meta.tar.gz) [Meta](https://meta.com) collected papers from the health domain. Most papers come directly from the publishers and public domains such as [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/). This data contains 15.660 million papers and 213.04 million citations after preprocessing. Title and abstract are used as the text. Each paper has 103.20 words and 13.60 links averagely. Different than other datasets, this dataset does not have labels for papers. Instead, papers are assigned to concepts, which is decided exclusively by the academic community, in the form of academic ontologies such as MeSH, OMIM, and the Gene Ontology. A paper could have multiple concepts.