Wesbury Lab Usenet Corpus: anonymized compilation of postings from 47,860 newsgroups that are english-language 2005-2010 (40 GB)
Wesbury Lab Wikipedia Corpus Snapshot of the many articles when you look at the English section of the Wikipedia that has been drawn in April 2010. It had been prepared, as described in more detail below, to eliminate all links and unimportant material (navigation text, etc) The corpus is untagged, natural text. Employed by Stanford NLP (1.8 GB).
: a corpus of manually-constructed description graphs, explanatory part reviews, and associated semistructured tablestore for the majority of publicly available primary technology exam concerns in the usa (8 MB)
Wikipedia Extraction (WEX): a prepared dump of english language wikipedia (66 GB)
Wikipedia XML information: complete copy of all of the Wikimedia wikis, by means of wikitext supply and metadata embedded in XML. (500 GB)
Yahoo! Responses questions that are comprehensive Responses: Yahoo! Answers corpus at the time of 10/25/2007. Contains 4,483,032 concerns and their answers. (3.6 GB)
Yahoo! Responses composed of concerns asked in French: Subset associated with the Yahoo! Answers corpus from 2006 to 2015 composed of 1.7 million questions posed in French, and their answers that are corresponding. (3.8 GB)
Yahoo! Responses Manner issues: subset associated with the Yahoo! Answers corpus from a 10/25/2007 dump, chosen with their properties that are linguistic. Contains 142,627 concerns and their responses. (104 MB)
Yahoo! HTML Forms removed from Publicly Webpages that is available a little test of pages that have complex HTML forms, contains 2.67 million complex types. (50+ GB)
Yahoo N-Gram Representations: This dataset contains representations that are n-gram. The information may act as a testbed for question rewriting task, a common issue in IR research also to term and phrase similarity task, that is typical in NLP research. (2.6 GB)
Yahoo! N-Grams, variation 2.0: n-grams (letter = 1 to 5), removed from the corpus of 14.6 million papers (126 million unique sentences, 3.4 billion running terms) crawled from over 12000 news-oriented web web web sites (12 GB)
Yahoo! Re Search Logs with Relevance Judgments: Annonymized Yahoo! Re Re Re Search Logs with Relevance Judgments (1.3 GB)
Yahoo! Semantically Annotated Snapshot for the English Wikipedia: English Wikipedia dated from 2006-11-04 prepared with a wide range of publicly-available NLP tools. 1,490,688 entries. (6 GB)
Yelp: including restaurant ranks and 2.2M reviews (on demand)
Youtube: 1.7 million youtube videos information (torrent)
- Awesome general public datasets/NLP (includes more listings)
- AWS Public Datasets
- CrowdFlower: information for everybody (a lot of small studies they carried out and information acquired by crowdsourcing for a certain task)
- Kaggle 1, 2 (be sure though that the kaggle competition information may be used outside the competition! )
- Open Library
- Quora (primarily annotated corpora)
- /r/datasets (endless range of datasets, many is scraped by amateurs though and never correctly documented or certified)
- Rs.io (another big list)
- Stackexchange: Opendata
- Stanford NLP team (primarily annotated corpora and TreeBanks or real tools that are NLP
- Yahoo! Webscope (also incorporates papers that utilize the information that is provided)
- SaudiNewsNet: 31,030 Arabic magazine articles alongwith metadata, removed from different online Saudi magazines. (2 MB)
- Number of Urdu Datasets for POS, NER and NLP tasks.
German Political Speeches Corpus: assortment of present speeches held by top German representatives (25 MB, 11 MTokens)
NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Designed for free for many Universities and non-profit companies. Need certainly to signal and deliver kind to get. (on demand)
Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for topic category. (26.1 MB)
100k German Court choices: Open Legal Data releases a dataset of 100,000 German court choices and 444,000 citations (772 MB)
- © 2020 GitHub, Inc.
- Contact GitHub
- We We We Blog
That action can’t be performed by you at this time around.
You finalized in with another window or tab. Reload to recharge your session. You finalized down in another tab or screen. Reload to recharge your session.