[Text](#h3_text) | |||||||
---|---|---|---|---|---|---|---|
Datasets | Vocabulary | Corpus | Description | Origial source | |||
[Text8](#h3.1_text8-a-href-.-text8.txt.gz-download-a-) | 253,854 | 17 million | First $10^8$ bytes of [enwik9](http://mattmahoney.net/dc/textdata.html). Widely used for demonstrating the performance of word embeddings | [mattmahoney](http://mattmahoney.net/dc/text8.zip) | |||
[News2010](#h3.2_news2010-a-href-.-news2010.txt.gz-download-a-) | 1,952,790 | 136 million | Text extracted from online news. | [wmt14](http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2010.en.shuffled.gz) | |||
[Wikipedia dump](#h3.3_wikipedia-dump-a-href-.-wiki20180201.txt.gz-download-a-) | 9,111,933 | 2,435 million | English articles from 20180201 Wikipedia dump. Text are extracted by [gensim WikiCorpus utils](https://radimrehurek.com/gensim/corpora/wikicorpus.html). | [wikimedia](https://dumps.wikimedia.org/) | |||
[Network](#h4_network) | |||||||
Datasets | Node | Edge | Labels | Context/Attributes | Description | Origial source | |
[WebKB](#h4.1_webkb-a-href-.-webkb.tar.gz-download-a-) | 877 | 1,608 | 5 | Context as 0/1 Vectors | Web pages from four computer science departments | [LINQS](https://linqs.soe.ucsc.edu/data) | |
[Cora](#h4.2_cora-a-href-.-cora.tar.gz-download-a-) | 2,708 | 5,429 | 7 | Context as 0/1 Vectors | Scientific publications | [LINQS](https://linqs.soe.ucsc.edu/data) | |
[CiteSeer](#h4.3_citeseer-a-href-.-citeseer.tar.gz-download-a-) | 3,319 | 4,722 | 6 | Context as 0/1 Vectors | Scientific publications | [LINQS](https://linqs.soe.ucsc.edu/data) | |
[BlogCatalog](#h4.4_blogcatalog-a-href-.-blogcatalog.tar.gz-download-a-) | 10,312 | 333,983 | 39(Multi-label) | -- | Social blog directory website. Widely used for demonstrating the performance of network embeddings | [link](http://socialcomputing.asu.edu/datasets/BlogCatalog3) | |
[PubMed](#h4.5_pubmed-a-href-.-pubmed.tar.gz-download-a-) | 19,717 | 44,338 | 3 | Context as TF-IDF Vectors | Scientific publications | [LINQS](https://linqs.soe.ucsc.edu/data) | |
[Flickr](#h4.6_flickr-a-href-.-flickr.tar.gz-download-a-) | 80,513 | 5,899,882 | 195(Multi-label) | -- | [link](http://socialcomputing.asu.edu/datasets/Flickr) | ||
[YouTube](#h4.7_youtube-a-href-.-youtube.tar.gz-download-a-) | 1,138,499 | 2,990,443 | 47(Multi-label) | -- | [link](http://socialcomputing.asu.edu/datasets/YouTube2) | ||
[Documents](#h5_documents) | |||||||
Datasets | Document | Vocabulary | Corpus | Labels | Description | Origial source | |
[IMDB](#h5.1_imdb-a-href-.-imdb.tar.gz-download-a-) | 100,000 | 256,510 | 2,6293,872 | 25,000 positive, 25,000 negative, 50,000 unlabeled |
Movie Review Dataset from IMDB | [Stanford](http://ai.stanford.edu/~amaas/data/sentiment/) | |
[arXiv](#h5.2_arxiv-a-href-.-arxiv.tar.gz-download-a-) | 840,218 | 1,144,437 | 109,994,801 | 127,872 cs, 88,896 physics, 297,094 math, 326,356 others |
academic papers from arXiv | [origial data](./org-text.txt) | |
[Academic Papers](#h6_academic-papers) | |||||||
Datasets | Papers | Vocabulary | Corpus | Citations | Labels | Description | Origial source |
[Cora Enrich](#h6.1_cora-enrich-a-href-.-cora_enrich.tar.gz-download-a-) | 2,708 | 25,955 | 2,522,761 | 5,429 | 7 | Papers in Machine Learning | [origial data](./cora_enrich.tar.gz) |
[AMinerV8](#h6.2_aminerv8-a-href-.-aminerv8.tar.gz-download-a-) | 777,262 | 69,606 | 5,435,631 | 4,191,523 | 10 | Papers from DBLP | [AMiner]() |
[AMinerV10](#h6.3_aminerv10-a-href-.-aminerv10.tar.gz-download-a-) | 2,725,523 | 588,303 | 231,398,743 | 25,166,994 | -- | Papers from DBLP | [AMiner]() |
[MAG](#h6.4_mag-a-href-.-mag.tar.gz-download-a-) | 46,642,396 | 2,475,973 | 402,303,849 | 528,682,289 | -- | [Microsoft](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/) | |
[Meta](#h6.5_meta-a-href-.-meta.tar.gz-download-a-) | 15,660,195 | 1,776,468 | 1,616,186,164 | 213,036,526 | -- | Papers in domain of Health | [Meta](http://meta.com) |