package documentation

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.

Corpus Reader Functions

Each corpus module defines one or more "corpus reader functions", which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:

  • If item is one of the unique identifiers listed in the corpus module's items variable, then the corresponding document will be loaded from the NLTK corpus package.
  • If item is a fileid, then that file will be read.

Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.

Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:

  • words(): list of str
  • sents(): list of (list of str)
  • paras(): list of (list of (list of str))
  • tagged_words(): list of (str,str) tuple
  • tagged_sents(): list of (list of (str,str))
  • tagged_paras(): list of (list of (list of (str,str)))
  • chunked_sents(): list of (Tree w/ (str,str) leaves)
  • parsed_sents(): list of (Tree with str leaves)
  • parsed_paras(): list of (list of (Tree with str leaves))
  • xml(): A single xml ElementTree
  • raw(): unprocessed corpus contents

For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():

>>> from nltk.corpus import brown
>>> print(", ".join(brown.words()))
The, Fulton, County, Grand, Jury, said, ...
Module aligned No module docstring; 1/1 class documented
Module api API for corpus readers.
Module bnc Corpus reader for the XML version of the British National Corpus.
Module bracket_parse Corpus reader for corpora that consist of parenthesis-delineated parse trees.
Module categorized_sents CorpusReader structured for corpora that contain one instance on each row. This CorpusReader is specifically used for the Subjectivity Dataset and the Sentence Polarity Dataset.
Module chasen No module docstring; 0/2 function, 1/1 class documented
Module childes Corpus reader for the XML version of the CHILDES corpus.
Module chunked A reader for corpora that contain chunked (and optionally tagged) documents.
Module cmudict The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6] ftp://ftp.cs.cmu.edu/project/speech/dict/ Copyright 1998 Carnegie Mellon University
Module comparative_sents CorpusReader for the Comparative Sentence Dataset.
Module conll Read CoNLL-style chunk fileids.
Module crubadan An NLTK interface for the n-gram statistics gathered from the corpora for each language using An Crubadan.
Module dependency Undocumented
Module framenet Corpus reader for the FrameNet 1.7 lexicon and corpus.
Module ieer Corpus reader for the Information Extraction and Entity Recognition Corpus.
Module indian Indian Language POS-Tagged Corpus Collected by A Kumaran, Microsoft Research, India Distributed with permission
Module ipipan Undocumented
Module knbc Undocumented
Module lin Undocumented
Module mte A reader for corpora whose documents are in MTE format.
Module nkjp No module docstring; 1/1 function, 4/5 classes documented
Module nombank No module docstring; 2/5 classes documented
Module nps_chat Undocumented
Module opinion_lexicon CorpusReader for the Opinion Lexicon.
Module panlex_lite CorpusReader for PanLex Lite, a stripped down version of PanLex distributed as an SQLite database. See the README.txt in the panlex_lite corpus directory for more information on PanLex Lite.
Module panlex_swadesh Undocumented
Module pl196x Undocumented
Module plaintext A reader for corpora that consist of plaintext documents.
Module ppattach Read lines from the Prepositional Phrase Attachment Corpus.
Module propbank No module docstring; 2/6 classes documented
Module pros_cons CorpusReader for the Pros and Cons dataset.
Module reviews CorpusReader for reviews corpora (syntax based on Customer Review Corpus).
Module rte Corpus reader for the Recognizing Textual Entailment (RTE) Challenge Corpora.
Module semcor Corpus reader for the SemCor Corpus.
Module senseval Read from the Senseval 2 Corpus.
Module sentiwordnet An NLTK interface for SentiWordNet
Module sinica_treebank Sinica Treebank Corpus Sample
Module string_category Read tuples from a corpus consisting of categorized strings. For example, from the question classification corpus:
Module switchboard No module docstring; 1/1 class documented
Module tagged A reader for corpora whose documents contain part-of-speech-tagged words.
Module timit Read tokens, phonemes and audio data from the NLTK TIMIT Corpus.
Module toolbox Module for reading, writing and manipulating Toolbox databases and settings fileids.
Module twitter A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON.
Module udhr UDHR corpus reader. It mostly deals with encodings.
Module util No module docstring; 4/11 functions, 3/3 classes documented
Module verbnet An NLTK interface to the VerbNet verb lexicon
Module wordlist Undocumented
Module wordnet An NLTK interface for WordNet
Module xmldocs Corpus reader for corpora whose documents are xml files.
Module ycoe Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts. The corpus is distributed by the Oxford Text Archive: ...

From __init__.py:

Class AlignedCorpusReader Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines.
Class AlpinoCorpusReader Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree...
Class BNCCorpusReader Corpus reader for the XML version of the British National Corpus.
Class BracketParseCorpusReader Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the "combined" section of the Penn Treebank, e.g. "(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))".
Class CategorizedBracketParseCorpusReader A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu>
Class CategorizedCorpusReader A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method categories(), which returns a list of the categories for the corpus or for a specified set of fileids; and overrides ...
Class CategorizedPlaintextCorpusReader A reader for plaintext corpora whose documents are divided into categories based on their file identifiers.
Class CategorizedSentencesCorpusReader A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows.
Class CategorizedTaggedCorpusReader A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers.
Class ChasenCorpusReader Undocumented
Class CHILDESCorpusReader Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at ``https://childes.talkbank.org/``. The XML version of CHILDES is located at ``https://childes.talkbank.org/data-xml/``...
Class ChunkedCorpusReader Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function...
Class CMUDictCorpusReader No class docstring; 4/4 methods documented
Class ComparativeSentencesCorpusReader Reader for the Comparative Sentence Dataset by Jindal and Liu (2006).
Class ConllChunkCorpusReader A ConllCorpusReader whose data file contains three columns: words, pos, and chunk.
Class ConllCorpusReader A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type...
Class CorpusReader A base class for "corpus reader" classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory...
Class CrubadanCorpusReader A corpus reader used to access language An Crubadan n-gram files.
Class DependencyCorpusReader No class docstring; 1/7 method documented
Class EuroparlCorpusReader Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. ...
Class FramenetCorpusReader A corpus reader for the Framenet Corpus.
Class IEERCorpusReader No summary
Class IndianCorpusReader List of words, one per line. Blank lines are ignored.
Class IPIPANCorpusReader Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus.
Class KNBCorpusReader __init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files.
Class LinThesaurusCorpusReader Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin.
Class MacMorphoCorpusReader A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using '_' as a separator. Sentence boundaries are based on the end-sentence tag ('_.'). Paragraph information is not included in the corpus, so each paragraph returned by ...
Class MTECorpusReader Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset...
Class MWAPPDBCorpusReader This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al...
Class NKJPCorpusReader No class docstring; 0/1 instance variable, 0/4 constant, 9/10 methods documented
Class NombankCorpusReader Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of "frameset files" which define the argument labels used by the annotations, on a per-noun basis...
Class NonbreakingPrefixesCorpusReader This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses' word tokenizer.
Class NPSChatCorpusReader Undocumented
Class OpinionLexiconCorpusReader Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored.
Class PanLexLiteCorpusReader No class docstring; 0/3 instance variable, 0/2 constant, 3/4 methods documented
Class PanlexSwadeshCorpusReader This is a class to read the PanLex Swadesh list from
Class Pl196xCorpusReader No class docstring; 0/3 instance variable, 0/1 class variable, 1/14 method documented
Class PlaintextCorpusReader Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.
Class PortugueseCategorizedPlaintextCorpusReader Undocumented
Class PPAttachmentCorpusReader sentence_id verb noun1 preposition noun2 attachment
Class PropbankCorpusReader Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of "frameset files" which define the argument labels used by the annotations, on a per-verb basis...
Class ProsConsCorpusReader Reader for the Pros and Cons sentence dataset.
Class ReviewsCorpusReader Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization.
Class RTECorpusReader Corpus reader for corpora in RTE challenges.
Class SemcorCorpusReader Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the ``xml()`` method. For access to simple word lists and tagged word lists, use ``words()``, ``sents()``, ``tagged_words()``, and ``tagged_sents()``.
Class SensevalCorpusReader No class docstring; 1/3 method documented
Class SentiSynset No class docstring; 0/4 instance variable, 1/6 method documented
Class SentiWordNetCorpusReader No class docstring; 0/1 instance variable, 1/5 method documented
Class SinicaTreebankCorpusReader Reader for the sinica treebank.
Class StringCategoryCorpusReader No class docstring; 0/1 instance variable, 2/4 methods documented
Class SwadeshCorpusReader No class docstring; 1/1 method documented
Class SwitchboardCorpusReader Undocumented
Class SyntaxCorpusReader An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define:
Class TaggedCorpusReader Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor...
Class TEICorpusView Undocumented
Class TimitCorpusReader Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files:
Class TimitTaggedCorpusReader A corpus reader for tagged sentences that are included in the TIMIT corpus.
Class ToolboxCorpusReader Undocumented
Class TwitterCorpusReader Reader for corpora that consist of Tweets represented as a list of line-delimited JSON.
Class UdhrCorpusReader Undocumented
Class UnicharsCorpusReader This class is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search...
Class VerbnetCorpusReader An NLTK interface to the VerbNet verb lexicon.
Class WordListCorpusReader List of words, one per line. Blank lines are ignored.
Class WordNetCorpusReader A corpus reader used to access wordnet or its variants.
Class WordNetICCorpusReader A corpus reader for the WordNet information content corpus.
Class XMLCorpusReader Corpus reader for corpora whose documents are xml files.
Class YCOECorpusReader Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts.
Function find_corpus_fileids Undocumented
Function tagged_treebank_para_block_reader Undocumented
def find_corpus_fileids(root, regexp): (source)

Undocumented

def tagged_treebank_para_block_reader(stream): (source)

Undocumented