nltk.corpus.reader.tagged.TaggedCorpusView

class documentation

class TaggedCorpusView(StreamBackedCorpusView): (source)

Constructor: TaggedCorpusView(corpus_file, encoding, tagged, group_by_sent, ...)

A specialized corpus view for tagged documents. It can be customized via flags to divide the tagged corpus documents up by sentence or paragraph, and to include or omit part of speech tags. TaggedCorpusView objects are typically created by TaggedCorpusReader (not directly by nltk users).

Method	`__init__`	Create a new corpus view, based on the file `fileid`, and read with `block_reader`. See the class documentation for more information.
Method	`read_block`	Reads one paragraph at a time.
Instance Variable	`_group_by_para`	Undocumented
Instance Variable	`_group_by_sent`	Undocumented
Instance Variable	`_para_block_reader`	Undocumented
Instance Variable	`_sent_tokenizer`	Undocumented
Instance Variable	`_sep`	Undocumented
Instance Variable	`_tag_mapping_function`	Undocumented
Instance Variable	`_tagged`	Undocumented
Instance Variable	`_word_tokenizer`	Undocumented

Inherited from StreamBackedCorpusView:

Method	`__add__`	Undocumented
Method	`__enter__`	Undocumented
Method	`__exit__`	Undocumented
Method	`__getitem__`	Undocumented
Method	`__len__`	Undocumented
Method	`__mul__`	Undocumented
Method	`__radd__`	Undocumented
Method	`__rmul__`	Undocumented
Method	`close`	Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view)...
Method	`iterate_from`	Undocumented
Class Variable	`fileid`	Undocumented
Method	`_open`	Open the file stream associated with this corpus view. This will be called performed if any value is read from the view while its file stream is closed.
Instance Variable	`_block_reader`	The function used to read a single block from the underlying file stream.
Instance Variable	`_cache`	A cache of the most recently read block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block.
Instance Variable	`_current_blocknum`	This variable is set to the index of the next block that will be read, immediately before `self.read_block()` is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current block number.
Instance Variable	`_current_toknum`	This variable is set to the index of the next token that will be read, immediately before `self.read_block()` is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current token number.
Instance Variable	`_encoding`	Undocumented
Instance Variable	`_eofpos`	The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached.
Instance Variable	`_fileid`	Undocumented
Instance Variable	`_filepos`	A list containing the file position of each block that has been processed. In particular, `_toknum[i]` is the file position of the first character in block `i`. Together with `_toknum`, this forms a partial mapping between token indices and file positions.
Instance Variable	`_len`	The total number of tokens in the corpus, if known; or None, if the number of tokens is not yet known.
Instance Variable	`_stream`	The stream used to access the underlying corpus file.
Instance Variable	`_toknum`	A list containing the token index of each block that has been processed. In particular, `_toknum[i]` is the token index of the first token in block `i`. Together with `_filepos`, this forms a partial mapping between token indices and file positions.

def __init__(self, corpus_file, encoding, tagged, group_by_sent, group_by_para, sep, word_tokenizer, sent_tokenizer, para_block_reader, tag_mapping_function=None): (source) ¶

overrides nltk.corpus.reader.util.StreamBackedCorpusView.__init__

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters
corpus_file	Undocumented
encoding	The unicode encoding that should be used to read the file's contents. If no encoding is specified, then the file's contents will be read as a non-unicode string (i.e., a str).
tagged	Undocumented
group_by_sent	Undocumented
group_by_para	Undocumented
sep	Undocumented
word_tokenizer	Undocumented
sent_tokenizer	Undocumented
para_block_reader	Undocumented
tag_mapping_function	Undocumented
fileid	The path to the file that is read by this corpus view. `fileid` can either be a string or a `PathPointer`.
startpos	The file position at which the view will start reading. This can be used to skip over preface sections.