nltk.corpus.reader.util.PickleCorpusView

class documentation

class PickleCorpusView(StreamBackedCorpusView): (source)

Constructor: PickleCorpusView(fileid, delete_on_gc)

A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using pickle.dump). One use case for this class is to store the result of running feature detection on a corpus to disk. This can be useful when performing feature detection is expensive (so we don't want to repeat it); but the corpus is too large to store in memory. The following example illustrates this technique:

>>> from nltk.corpus.reader.util import PickleCorpusView
>>> from nltk.util import LazyMap
>>> feature_corpus = LazyMap(detect_features, corpus) # doctest: +SKIP
>>> PickleCorpusView.write(feature_corpus, some_fileid)  # doctest: +SKIP
>>> pcv = PickleCorpusView(some_fileid) # doctest: +SKIP

Class Method	`cache_to_tempfile`	Write the given sequence to a temporary file as a pickle corpus; and then return a `PickleCorpusView` view for that temporary corpus file.
Class Method	`write`	Undocumented
Method	`__del__`	If `delete_on_gc` was set to true when this `PickleCorpusView` was created, then delete the corpus view's fileid. (This method is called whenever a `PickledCorpusView` is garbage-collected.
Method	`__init__`	Create a new corpus view that reads the pickle corpus `fileid`.
Method	`read_block`	Read a block from the input stream.
Constant	`BLOCK_SIZE`	Undocumented
Constant	`PROTOCOL`	Undocumented
Instance Variable	`_delete_on_gc`	Undocumented

Inherited from StreamBackedCorpusView:

Method	`__add__`	Undocumented
Method	`__enter__`	Undocumented
Method	`__exit__`	Undocumented
Method	`__getitem__`	Undocumented
Method	`__len__`	Undocumented
Method	`__mul__`	Undocumented
Method	`__radd__`	Undocumented
Method	`__rmul__`	Undocumented
Method	`close`	Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view)...
Method	`iterate_from`	Undocumented
Class Variable	`fileid`	Undocumented
Method	`_open`	Open the file stream associated with this corpus view. This will be called performed if any value is read from the view while its file stream is closed.
Instance Variable	`_block_reader`	The function used to read a single block from the underlying file stream.
Instance Variable	`_cache`	A cache of the most recently read block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block.
Instance Variable	`_current_blocknum`	This variable is set to the index of the next block that will be read, immediately before `self.read_block()` is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current block number.
Instance Variable	`_current_toknum`	This variable is set to the index of the next token that will be read, immediately before `self.read_block()` is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current token number.
Instance Variable	`_encoding`	Undocumented
Instance Variable	`_eofpos`	The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached.
Instance Variable	`_fileid`	Undocumented
Instance Variable	`_filepos`	A list containing the file position of each block that has been processed. In particular, `_toknum[i]` is the file position of the first character in block `i`. Together with `_toknum`, this forms a partial mapping between token indices and file positions.
Instance Variable	`_len`	The total number of tokens in the corpus, if known; or None, if the number of tokens is not yet known.
Instance Variable	`_stream`	The stream used to access the underlying corpus file.
Instance Variable	`_toknum`	A list containing the token index of each block that has been processed. In particular, `_toknum[i]` is the token index of the first token in block `i`. Together with `_filepos`, this forms a partial mapping between token indices and file positions.

@classmethod
def cache_to_tempfile(cls, sequence, delete_on_gc=True): (source) ¶

Write the given sequence to a temporary file as a pickle corpus; and then return a PickleCorpusView view for that temporary corpus file.

Parameters
sequence	Undocumented
delete_on_gc	If true, then the temporary file will be deleted whenever this object gets garbage-collected.

@classmethod
def write(cls, sequence, output_file): (source) ¶

Undocumented

def __del__(self): (source) ¶

If delete_on_gc was set to true when this PickleCorpusView was created, then delete the corpus view's fileid. (This method is called whenever a PickledCorpusView is garbage-collected.

def __init__(self, fileid, delete_on_gc=False): (source) ¶

overrides nltk.corpus.reader.util.StreamBackedCorpusView.__init__

Create a new corpus view that reads the pickle corpus fileid.

Parameters
fileid	Undocumented
delete_on_gc	If true, then `fileid` will be deleted whenever this object gets garbage-collected.

def read_block(self, stream): (source) ¶

overrides nltk.corpus.reader.util.StreamBackedCorpusView.read_block

Read a block from the input stream.

Parameters
stream:stream	an input stream
Returns
list(any)	a block of tokens from the input stream

BLOCK_SIZE: int = (source) ¶

Undocumented

Value

PROTOCOL: int = (source) ¶

Undocumented

Value

-1

_delete_on_gc = (source) ¶

Undocumented