package documentation

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus files in a variety of formats. These functions can be used to read both the corpus files that are distributed in the NLTK corpus package, and corpus files that are part of external corpora.

Available Corpora

Please see http://www.nltk.org/nltk_data/ for a complete list. Install corpora using nltk.download().

Corpus Reader Functions

Each corpus module defines one or more "corpus reader functions", which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

  • If item is one of the unique identifiers listed in the corpus module's items variable, then the corresponding document will be loaded from the NLTK corpus package.
  • If item is a filename, then that file will be read.

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

  • words(): list of str
  • sents(): list of (list of str)
  • paras(): list of (list of (list of str))
  • tagged_words(): list of (str,str) tuple
  • tagged_sents(): list of (list of (str,str))
  • tagged_paras(): list of (list of (list of (str,str)))
  • chunked_sents(): list of (Tree w/ (str,str) leaves)
  • parsed_sents(): list of (Tree with str leaves)
  • parsed_paras(): list of (list of (Tree with str leaves))
  • xml(): A single xml ElementTree
  • raw(): unprocessed corpus contents

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()))
The, Fulton, County, Grand, Jury, said, ...
Module europarl_raw Undocumented
Package reader NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.
Module util No module docstring; 0/1 constant, 1/1 function, 1/1 class documented

From __init__.py:

Function demo Undocumented
Variable abc Undocumented
Variable alpino Undocumented
Variable brown Undocumented
Variable cess_cat Undocumented
Variable cess_esp Undocumented
Variable cmudict Undocumented
Variable comparative_sentences Undocumented
Variable comtrans Undocumented
Variable conll2000 Undocumented
Variable conll2002 Undocumented
Variable conll2007 Undocumented
Variable crubadan Undocumented
Variable dependency_treebank Undocumented
Variable floresta Undocumented
Variable framenet Undocumented
Variable framenet15 Undocumented
Variable gazetteers Undocumented
Variable genesis Undocumented
Variable gutenberg Undocumented
Variable ieer Undocumented
Variable inaugural Undocumented
Variable indian Undocumented
Variable jeita Undocumented
Variable knbc Undocumented
Variable lin_thesaurus Undocumented
Variable mac_morpho Undocumented
Variable machado Undocumented
Variable masc_tagged Undocumented
Variable movie_reviews Undocumented
Variable multext_east Undocumented
Variable names Undocumented
Variable nombank Undocumented
Variable nombank_ptb Undocumented
Variable nonbreaking_prefixes Undocumented
Variable nps_chat Undocumented
Variable opinion_lexicon Undocumented
Variable perluniprops Undocumented
Variable ppattach Undocumented
Variable product_reviews_1 Undocumented
Variable product_reviews_2 Undocumented
Variable propbank Undocumented
Variable propbank_ptb Undocumented
Variable pros_cons Undocumented
Variable ptb Undocumented
Variable qc Undocumented
Variable reuters Undocumented
Variable rte Undocumented
Variable semcor Undocumented
Variable senseval Undocumented
Variable sentence_polarity Undocumented
Variable sentiwordnet Undocumented
Variable shakespeare Undocumented
Variable sinica_treebank Undocumented
Variable state_union Undocumented
Variable stopwords Undocumented
Variable subjectivity Undocumented
Variable swadesh Undocumented
Variable swadesh110 Undocumented
Variable swadesh207 Undocumented
Variable switchboard Undocumented
Variable timit Undocumented
Variable timit_tagged Undocumented
Variable toolbox Undocumented
Variable treebank Undocumented
Variable treebank_chunk Undocumented
Variable treebank_raw Undocumented
Variable twitter_samples Undocumented
Variable udhr Undocumented
Variable udhr2 Undocumented
Variable universal_treebanks Undocumented
Variable verbnet Undocumented
Variable webtext Undocumented
Variable wordnet Undocumented
Variable wordnet_ic Undocumented
Variable words Undocumented
def demo(): (source)

Undocumented

Undocumented

Undocumented

Undocumented

cess_cat = (source)

Undocumented

cess_esp = (source)

Undocumented

Undocumented

comparative_sentences = (source)

Undocumented

comtrans = (source)

Undocumented

conll2000 = (source)

Undocumented

conll2002 = (source)

Undocumented

conll2007 = (source)

Undocumented

crubadan = (source)

Undocumented

dependency_treebank = (source)

Undocumented

floresta = (source)

Undocumented

framenet = (source)

Undocumented

framenet15 = (source)

Undocumented

gazetteers = (source)

Undocumented

Undocumented

gutenberg = (source)

Undocumented

Undocumented

inaugural = (source)

Undocumented

Undocumented

Undocumented

Undocumented

lin_thesaurus = (source)

Undocumented

mac_morpho = (source)

Undocumented

Undocumented

masc_tagged = (source)

Undocumented

movie_reviews = (source)

Undocumented

multext_east = (source)

Undocumented

Undocumented

Undocumented

nombank_ptb = (source)

Undocumented

nonbreaking_prefixes = (source)

Undocumented

nps_chat = (source)

Undocumented

opinion_lexicon = (source)

Undocumented

perluniprops = (source)

Undocumented

ppattach = (source)

Undocumented

product_reviews_1 = (source)

Undocumented

product_reviews_2 = (source)

Undocumented

propbank = (source)

Undocumented

propbank_ptb = (source)

Undocumented

pros_cons = (source)

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

senseval = (source)

Undocumented

sentence_polarity = (source)

Undocumented

sentiwordnet = (source)

Undocumented

shakespeare = (source)

Undocumented

sinica_treebank = (source)

Undocumented

state_union = (source)

Undocumented

stopwords = (source)

Undocumented

subjectivity = (source)

Undocumented

Undocumented

swadesh110 = (source)

Undocumented

swadesh207 = (source)

Undocumented

switchboard = (source)

Undocumented

Undocumented

timit_tagged = (source)

Undocumented

Undocumented

treebank = (source)

Undocumented

treebank_chunk = (source)

Undocumented

treebank_raw = (source)

Undocumented

twitter_samples = (source)

Undocumented

Undocumented

Undocumented

universal_treebanks = (source)

Undocumented

Undocumented

Undocumented

Undocumented

wordnet_ic = (source)

Undocumented

Undocumented