nltk.corpus.reader

package documentation

(source)

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.

Corpus Reader Functions

Each corpus module defines one or more "corpus reader functions", which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

If item is one of the unique identifiers listed in the corpus module's items variable, then the corresponding document will be loaded from the NLTK corpus package.
If item is a fileid, then that file will be read.

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

words(): list of str
sents(): list of (list of str)
paras(): list of (list of (list of str))
tagged_words(): list of (str,str) tuple
tagged_sents(): list of (list of (str,str))
tagged_paras(): list of (list of (list of (str,str)))
chunked_sents(): list of (Tree w/ (str,str) leaves)
parsed_sents(): list of (Tree with str leaves)
parsed_paras(): list of (list of (Tree with str leaves))
xml(): A single xml ElementTree
raw(): unprocessed corpus contents

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()))
The, Fulton, County, Grand, Jury, said, ...

Module	`aligned`	No module docstring; 1/1 class documented
Module	`api`	API for corpus readers.
Module	`bnc`	Corpus reader for the XML version of the British National Corpus.
Module	`bracket_parse`	Corpus reader for corpora that consist of parenthesis-delineated parse trees.
Module	`categorized_sents`	CorpusReader structured for corpora that contain one instance on each row. This CorpusReader is specifically used for the Subjectivity Dataset and the Sentence Polarity Dataset.
Module	`chasen`	No module docstring; 0/2 function, 1/1 class documented
Module	`childes`	Corpus reader for the XML version of the CHILDES corpus.
Module	`chunked`	A reader for corpora that contain chunked (and optionally tagged) documents.
Module	`cmudict`	The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6] ftp://ftp.cs.cmu.edu/project/speech/dict/ Copyright 1998 Carnegie Mellon University
Module	`comparative_sents`	CorpusReader for the Comparative Sentence Dataset.
Module	`conll`	Read CoNLL-style chunk fileids.
Module	`crubadan`	An NLTK interface for the n-gram statistics gathered from the corpora for each language using An Crubadan.
Module	`dependency`	Undocumented
Module	`framenet`	Corpus reader for the FrameNet 1.7 lexicon and corpus.
Module	`ieer`	Corpus reader for the Information Extraction and Entity Recognition Corpus.
Module	`indian`	Indian Language POS-Tagged Corpus Collected by A Kumaran, Microsoft Research, India Distributed with permission
Module	`ipipan`	Undocumented
Module	`knbc`	Undocumented
Module	`lin`	Undocumented
Module	`mte`	A reader for corpora whose documents are in MTE format.
Module	`nkjp`	No module docstring; 1/1 function, 4/5 classes documented
Module	`nombank`	No module docstring; 2/5 classes documented
Module	`nps_chat`	Undocumented
Module	`opinion_lexicon`	CorpusReader for the Opinion Lexicon.
Module	`panlex_lite`	CorpusReader for PanLex Lite, a stripped down version of PanLex distributed as an SQLite database. See the README.txt in the panlex_lite corpus directory for more information on PanLex Lite.
Module	`panlex_swadesh`	Undocumented
Module	`pl196x`	Undocumented
Module	`plaintext`	A reader for corpora that consist of plaintext documents.
Module	`ppattach`	Read lines from the Prepositional Phrase Attachment Corpus.
Module	`propbank`	No module docstring; 2/6 classes documented
Module	`pros_cons`	CorpusReader for the Pros and Cons dataset.
Module	`reviews`	CorpusReader for reviews corpora (syntax based on Customer Review Corpus).
Module	`rte`	Corpus reader for the Recognizing Textual Entailment (RTE) Challenge Corpora.
Module	`semcor`	Corpus reader for the SemCor Corpus.
Module	`senseval`	Read from the Senseval 2 Corpus.
Module	`sentiwordnet`	An NLTK interface for SentiWordNet
Module	`sinica_treebank`	Sinica Treebank Corpus Sample
Module	`string_category`	Read tuples from a corpus consisting of categorized strings. For example, from the question classification corpus:
Module	`switchboard`	No module docstring; 1/1 class documented
Module	`tagged`	A reader for corpora whose documents contain part-of-speech-tagged words.
Module	`timit`	Read tokens, phonemes and audio data from the NLTK TIMIT Corpus.
Module	`toolbox`	Module for reading, writing and manipulating Toolbox databases and settings fileids.
Module	`twitter`	A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON.
Module	`udhr`	UDHR corpus reader. It mostly deals with encodings.
Module	`util`	No module docstring; 4/11 functions, 3/3 classes documented
Module	`verbnet`	An NLTK interface to the VerbNet verb lexicon
Module	`wordlist`	Undocumented
Module	`wordnet`	An NLTK interface for WordNet
Module	`xmldocs`	Corpus reader for corpora whose documents are xml files.
Module	`ycoe`	Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts. The corpus is distributed by the Oxford Text Archive: ...

From __init__.py:

Class	`AlignedCorpusReader`	Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.
Class	`AlpinoCorpusReader`	Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree...
Class	`BNCCorpusReader`	Corpus reader for the XML version of the British National Corpus.
Class	`BracketParseCorpusReader`	Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the "combined" section of the Penn Treebank, e.g. "(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))".
Class	`CategorizedBracketParseCorpusReader`	A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>
Class	`CategorizedCorpusReader`	A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method `categories()`, which returns a list of the categories for the corpus or for a specified set of fileids; and overrides ...
Class	`CategorizedPlaintextCorpusReader`	A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.
Class	`CategorizedSentencesCorpusReader`	A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.
Class	`CategorizedTaggedCorpusReader`	A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.
Class	`ChasenCorpusReader`	Undocumented
Class	`CHILDESCorpusReader`	Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at ``https://childes.talkbank.org/``. The XML version of CHILDES is located at ``https://childes.talkbank.org/data-xml/``...
Class	`ChunkedCorpusReader`	Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function...
Class	`CMUDictCorpusReader`	No class docstring; 4/4 methods documented
Class	`ComparativeSentencesCorpusReader`	Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).
Class	`ConllChunkCorpusReader`	A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.
Class	`ConllCorpusReader`	A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type...
Class	`CorpusReader`	A base class for "corpus reader" classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory...
Class	`CrubadanCorpusReader`	A corpus reader used to access language An Crubadan n-gram files.
Class	`DependencyCorpusReader`	No class docstring; 1/7 method documented
Class	`EuroparlCorpusReader`	Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. ...
Class	`FramenetCorpusReader`	A corpus reader for the Framenet Corpus.
Class	`IEERCorpusReader`	No summary
Class	`IndianCorpusReader`	List of words, one per line. Blank lines are ignored.
Class	`IPIPANCorpusReader`	Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.
Class	`KNBCorpusReader`	`__init__`, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
Class	`LinThesaurusCorpusReader`	Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.
Class	`MacMorphoCorpusReader`	A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using '_' as a separator. Sentence boundaries are based on the end-sentence tag ('_.'). Paragraph information is not included in the corpus, so each paragraph returned by ...
Class	`MTECorpusReader`	Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset...
Class	`MWAPPDBCorpusReader`	This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al...
Class	`NKJPCorpusReader`	No class docstring; 0/1 instance variable, 0/4 constant, 9/10 methods documented
Class	`NombankCorpusReader`	Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of "frameset files" which define the argument labels used by the annotations, on a per-noun basis...
Class	`NonbreakingPrefixesCorpusReader`	This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses' word tokenizer.
Class	`NPSChatCorpusReader`	Undocumented
Class	`OpinionLexiconCorpusReader`	Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored.
Class	`PanLexLiteCorpusReader`	No class docstring; 0/3 instance variable, 0/2 constant, 3/4 methods documented
Class	`PanlexSwadeshCorpusReader`	This is a class to read the PanLex Swadesh list from
Class	`Pl196xCorpusReader`	No class docstring; 0/3 instance variable, 0/1 class variable, 1/14 method documented
Class	`PlaintextCorpusReader`	Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.
Class	`PortugueseCategorizedPlaintextCorpusReader`	Undocumented
Class	`PPAttachmentCorpusReader`	sentence_id verb noun1 preposition noun2 attachment
Class	`PropbankCorpusReader`	Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of "frameset files" which define the argument labels used by the annotations, on a per-verb basis...
Class	`ProsConsCorpusReader`	Reader for the Pros and Cons sentence dataset.
Class	`ReviewsCorpusReader`	Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.
Class	`RTECorpusReader`	Corpus reader for corpora in RTE challenges.
Class	`SemcorCorpusReader`	Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the ``xml()`` method. For access to simple word lists and tagged word lists, use ``words()``, ``sents()``, ``tagged_words()``, and ``tagged_sents()``.
Class	`SensevalCorpusReader`	No class docstring; 1/3 method documented
Class	`SentiSynset`	No class docstring; 0/4 instance variable, 1/6 method documented
Class	`SentiWordNetCorpusReader`	No class docstring; 0/1 instance variable, 1/5 method documented
Class	`SinicaTreebankCorpusReader`	Reader for the sinica treebank.
Class	`StringCategoryCorpusReader`	No class docstring; 0/1 instance variable, 2/4 methods documented
Class	`SwadeshCorpusReader`	No class docstring; 1/1 method documented
Class	`SwitchboardCorpusReader`	Undocumented
Class	`SyntaxCorpusReader`	An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:
Class	`TaggedCorpusReader`	Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor...
Class	`TEICorpusView`	Undocumented
Class	`TimitCorpusReader`	Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:
Class	`TimitTaggedCorpusReader`	A corpus reader for tagged sentences that are included in the TIMIT corpus.
Class	`ToolboxCorpusReader`	Undocumented
Class	`TwitterCorpusReader`	Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.
Class	`UdhrCorpusReader`	Undocumented
Class	`UnicharsCorpusReader`	This class is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search...
Class	`VerbnetCorpusReader`	An NLTK interface to the VerbNet verb lexicon.
Class	`WordListCorpusReader`	List of words, one per line. Blank lines are ignored.
Class	`WordNetCorpusReader`	A corpus reader used to access wordnet or its variants.
Class	`WordNetICCorpusReader`	A corpus reader for the WordNet information content corpus.
Class	`XMLCorpusReader`	Corpus reader for corpora whose documents are xml files.
Class	`YCOECorpusReader`	Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.
Function	`find_corpus_fileids`	Undocumented
Function	`tagged_treebank_para_block_reader`	Undocumented

def find_corpus_fileids(root, regexp): (source) ¶

Undocumented

def tagged_treebank_para_block_reader(stream): (source) ¶

Undocumented