NLTK corpus readers. The modules in this package provide functions that can be used to read corpus fileids in a variety of formats. These functions can be used to read both the corpus fileids that are distributed in the NLTK corpus package, and corpus fileids that are part of external corpora.
Corpus Reader Functions
Each corpus module defines one or more "corpus reader functions", which can be used to read documents from that corpus. These functions take an argument, item, which is used to indicate which document should be read from the corpus:
- If item is one of the unique identifiers listed in the corpus module's items variable, then the corresponding document will be loaded from the NLTK corpus package.
- If item is a fileid, then that file will be read.
Additionally, corpus reader functions can be given lists of item names; in which case, they will return a concatenation of the corresponding documents.
Corpus reader functions are named based on the type of information they return. Some common examples, and their return types, are:
- words(): list of str
- sents(): list of (list of str)
- paras(): list of (list of (list of str))
- tagged_words(): list of (str,str) tuple
- tagged_sents(): list of (list of (str,str))
- tagged_paras(): list of (list of (list of (str,str)))
- chunked_sents(): list of (Tree w/ (str,str) leaves)
- parsed_sents(): list of (Tree with str leaves)
- parsed_paras(): list of (list of (Tree with str leaves))
- xml(): A single xml ElementTree
- raw(): unprocessed corpus contents
For example, to read a list of the words in the Brown Corpus, use nltk.corpus.brown.words():
>>> from nltk.corpus import brown >>> print(", ".join(brown.words())) The, Fulton, County, Grand, Jury, said, ...
Module | aligned |
No module docstring; 1/1 class documented |
Module | api |
API for corpus readers. |
Module | bnc |
Corpus reader for the XML version of the British National Corpus. |
Module | bracket |
Corpus reader for corpora that consist of parenthesis-delineated parse trees. |
Module | categorized |
CorpusReader structured for corpora that contain one instance on each row. This CorpusReader is specifically used for the Subjectivity Dataset and the Sentence Polarity Dataset. |
Module | chasen |
No module docstring; 0/2 function, 1/1 class documented |
Module | childes |
Corpus reader for the XML version of the CHILDES corpus. |
Module | chunked |
A reader for corpora that contain chunked (and optionally tagged) documents. |
Module | cmudict |
The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6] ftp://ftp.cs.cmu.edu/project/speech/dict/ Copyright 1998 Carnegie Mellon University |
Module | comparative |
CorpusReader for the Comparative Sentence Dataset. |
Module | conll |
Read CoNLL-style chunk fileids. |
Module | crubadan |
An NLTK interface for the n-gram statistics gathered from the corpora for each language using An Crubadan. |
Module | dependency |
Undocumented |
Module | framenet |
Corpus reader for the FrameNet 1.7 lexicon and corpus. |
Module | ieer |
Corpus reader for the Information Extraction and Entity Recognition Corpus. |
Module | indian |
Indian Language POS-Tagged Corpus Collected by A Kumaran, Microsoft Research, India Distributed with permission |
Module | ipipan |
Undocumented |
Module | knbc |
Undocumented |
Module | lin |
Undocumented |
Module | mte |
A reader for corpora whose documents are in MTE format. |
Module | nkjp |
No module docstring; 1/1 function, 4/5 classes documented |
Module | nombank |
No module docstring; 2/5 classes documented |
Module | nps |
Undocumented |
Module | opinion |
CorpusReader for the Opinion Lexicon. |
Module | panlex |
CorpusReader for PanLex Lite, a stripped down version of PanLex distributed as an SQLite database. See the README.txt in the panlex_lite corpus directory for more information on PanLex Lite. |
Module | panlex |
Undocumented |
Module | pl196x |
Undocumented |
Module | plaintext |
A reader for corpora that consist of plaintext documents. |
Module | ppattach |
Read lines from the Prepositional Phrase Attachment Corpus. |
Module | propbank |
No module docstring; 2/6 classes documented |
Module | pros |
CorpusReader for the Pros and Cons dataset. |
Module | reviews |
CorpusReader for reviews corpora (syntax based on Customer Review Corpus). |
Module | rte |
Corpus reader for the Recognizing Textual Entailment (RTE) Challenge Corpora. |
Module | semcor |
Corpus reader for the SemCor Corpus. |
Module | senseval |
Read from the Senseval 2 Corpus. |
Module | sentiwordnet |
An NLTK interface for SentiWordNet |
Module | sinica |
Sinica Treebank Corpus Sample |
Module | string |
Read tuples from a corpus consisting of categorized strings. For example, from the question classification corpus: |
Module | switchboard |
No module docstring; 1/1 class documented |
Module | tagged |
A reader for corpora whose documents contain part-of-speech-tagged words. |
Module | timit |
Read tokens, phonemes and audio data from the NLTK TIMIT Corpus. |
Module | toolbox |
Module for reading, writing and manipulating Toolbox databases and settings fileids. |
Module | twitter |
A reader for corpora that consist of Tweets. It is assumed that the Tweets have been serialised into line-delimited JSON. |
Module | udhr |
UDHR corpus reader. It mostly deals with encodings. |
Module | util |
No module docstring; 4/11 functions, 3/3 classes documented |
Module | verbnet |
An NLTK interface to the VerbNet verb lexicon |
Module | wordlist |
Undocumented |
Module | wordnet |
An NLTK interface for WordNet |
Module | xmldocs |
Corpus reader for corpora whose documents are xml files. |
Module | ycoe |
Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts. The corpus is distributed by the Oxford Text Archive: ... |
From __init__.py
:
Class |
|
Reader for corpora of word-aligned sentences. Tokens are assumed to be separated by whitespace. Sentences begin on separate lines. |
Class |
|
Reader for the Alpino Dutch Treebank. This corpus has a lexical breakdown structure embedded, as read by _parse Unfortunately this puts punctuation and some other words out of the sentence order in the xml element tree... |
Class |
|
Corpus reader for the XML version of the British National Corpus. |
Class |
|
Reader for corpora that consist of parenthesis-delineated parse trees, like those found in the "combined" section of the Penn Treebank, e.g. "(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))". |
Class |
|
A reader for parsed corpora whose documents are divided into categories based on their file identifiers. @author: Nathan Schneider <nschneid@cs.cmu.edu> |
Class |
|
A mixin class used to aid in the implementation of corpus readers for categorized corpora. This class defines the method categories(), which returns a list of the categories for the corpus or for a specified set of fileids; and overrides ... |
Class |
|
A reader for plaintext corpora whose documents are divided into categories based on their file identifiers. |
Class |
|
A reader for corpora in which each row represents a single instance, mainly a sentence. Istances are divided into categories based on their file identifiers (see CategorizedCorpusReader). Since many corpora allow rows that contain more than one sentence, it is possible to specify a sentence tokenizer to retrieve all sentences instead than all rows. |
Class |
|
A reader for part-of-speech tagged corpora whose documents are divided into categories based on their file identifiers. |
Class |
|
Undocumented |
Class |
|
Corpus reader for the XML version of the CHILDES corpus. The CHILDES corpus is available at ``https://childes.talkbank.org/``. The XML version of CHILDES is located at ``https://childes.talkbank.org/data-xml/``... |
Class |
|
Reader for chunked (and optionally tagged) corpora. Paragraphs are split using a block reader. They are then tokenized into sentences using a sentence tokenizer. Finally, these sentences are parsed into chunk trees using a string-to-chunktree conversion function... |
Class |
|
No class docstring; 4/4 methods documented |
Class |
|
Reader for the Comparative Sentence Dataset by Jindal and Liu (2006). |
Class |
|
A ConllCorpusReader whose data file contains three columns: words, pos, and chunk. |
Class |
|
A corpus reader for CoNLL-style files. These files consist of a series of sentences, separated by blank lines. Each sentence is encoded using a table (or "grid") of values, where each line corresponds to a single word, and each column corresponds to an annotation type... |
Class |
|
A base class for "corpus reader" classes, each of which can be used to read a specific corpus format. Each individual corpus reader instance is used to read a specific corpus, consisting of one or more files under a common root directory... |
Class |
|
A corpus reader used to access language An Crubadan n-gram files. |
Class |
|
No class docstring; 1/7 method documented |
Class |
|
Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. ... |
Class |
|
A corpus reader for the Framenet Corpus. |
Class |
|
No summary |
Class |
|
List of words, one per line. Blank lines are ignored. |
Class |
|
Corpus reader designed to work with corpus created by IPI PAN. See http://korpus.pl/en/ for more details about IPI PAN corpus. |
Class |
|
__init__, which specifies the location of the corpus and a method for detecting the sentence blocks in corpus files. |
Class |
|
Wrapper for the LISP-formatted thesauruses distributed by Dekang Lin. |
Class |
|
A corpus reader for the MAC_MORPHO corpus. Each line contains a single tagged word, using '_' as a separator. Sentence boundaries are based on the end-sentence tag ('_.'). Paragraph information is not included in the corpus, so each paragraph returned by ... |
Class |
|
Reader for corpora following the TEI-p5 xml scheme, such as MULTEXT-East. MULTEXT-East contains part-of-speech-tagged words with a quite precise tagging scheme. These tags can be converted to the Universal tagset... |
Class |
|
This class is used to read the list of word pairs from the subset of lexical pairs of The Paraphrase Database (PPDB) XXXL used in the Monolingual Word Alignment (MWA) algorithm described in Sultan et al... |
Class |
|
No class docstring; 0/1 instance variable, 0/4 constant, 9/10 methods documented |
Class |
|
Corpus reader for the nombank corpus, which augments the Penn Treebank with information about the predicate argument structure of every noun instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of "frameset files" which define the argument labels used by the annotations, on a per-noun basis... |
Class |
|
This is a class to read the nonbreaking prefixes textfiles from the Moses Machine Translation toolkit. These lists are used in the Python port of the Moses' word tokenizer. |
Class |
|
Undocumented |
Class |
|
Reader for Liu and Hu opinion lexicon. Blank lines and readme are ignored. |
Class |
|
No class docstring; 0/3 instance variable, 0/2 constant, 3/4 methods documented |
Class |
|
This is a class to read the PanLex Swadesh list from |
Class |
|
No class docstring; 0/3 instance variable, 0/1 class variable, 1/14 method documented |
Class |
|
Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor. |
Class |
|
Undocumented |
Class |
|
sentence_id verb noun1 preposition noun2 attachment |
Class |
|
Corpus reader for the propbank corpus, which augments the Penn Treebank with information about the predicate argument structure of every verb instance. The corpus consists of two parts: the predicate-argument annotations themselves, and a set of "frameset files" which define the argument labels used by the annotations, on a per-verb basis... |
Class |
|
Reader for the Pros and Cons sentence dataset. |
Class |
|
Reader for the Customer Review Data dataset by Hu, Liu (2004). Note: we are not applying any sentence tokenization at the moment, just word tokenization. |
Class |
|
Corpus reader for corpora in RTE challenges. |
Class |
|
Corpus reader for the SemCor Corpus. For access to the complete XML data structure, use the ``xml()`` method. For access to simple word lists and tagged word lists, use ``words()``, ``sents()``, ``tagged_words()``, and ``tagged_sents()``. |
Class |
|
No class docstring; 1/3 method documented |
Class |
|
No class docstring; 0/4 instance variable, 1/6 method documented |
Class |
|
No class docstring; 0/1 instance variable, 1/5 method documented |
Class |
|
Reader for the sinica treebank. |
Class |
|
No class docstring; 0/1 instance variable, 2/4 methods documented |
Class |
|
No class docstring; 1/1 method documented |
Class |
|
Undocumented |
Class |
|
An abstract base class for reading corpora consisting of syntactically parsed text. Subclasses should define: |
Class |
|
Reader for simple part-of-speech tagged corpora. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specified as parameters to the constructor... |
Class |
|
Undocumented |
Class |
|
Reader for the TIMIT corpus (or any other corpus with the same file layout and use of file formats). The corpus root directory should contain the following files: |
Class |
|
A corpus reader for tagged sentences that are included in the TIMIT corpus. |
Class |
|
Undocumented |
Class |
|
Reader for corpora that consist of Tweets represented as a list of line-delimited JSON. |
Class |
|
Undocumented |
Class |
|
This class is used to read lists of characters from the Perl Unicode Properties (see http://perldoc.perl.org/perluniprops.html). The files in the perluniprop.zip are extracted using the Unicode::Tussle module from http://search... |
Class |
|
An NLTK interface to the VerbNet verb lexicon. |
Class |
|
List of words, one per line. Blank lines are ignored. |
Class |
|
A corpus reader used to access wordnet or its variants. |
Class |
|
A corpus reader for the WordNet information content corpus. |
Class |
|
Corpus reader for corpora whose documents are xml files. |
Class |
|
Corpus reader for the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), a 1.5 million word syntactically-annotated corpus of Old English prose texts. |
Function | find |
Undocumented |
Function | tagged |
Undocumented |