class documentation

This CorpusView is used to skip the initial readme block of the corpus.

Method __init__ Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.
Instance Variable _filepos Undocumented

Inherited from StreamBackedCorpusView:

Method __add__ Undocumented
Method __enter__ Undocumented
Method __exit__ Undocumented
Method __getitem__ Undocumented
Method __len__ Undocumented
Method __mul__ Undocumented
Method __radd__ Undocumented
Method __rmul__ Undocumented
Method close Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view)...
Method iterate_from Undocumented
Method read_block Read a block from the input stream.
Class Variable fileid Undocumented
Method _open Open the file stream associated with this corpus view. This will be called performed if any value is read from the view while its file stream is closed.
Instance Variable _block_reader The function used to read a single block from the underlying file stream.
Instance Variable _cache A cache of the most recently read block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block.
Instance Variable _current_blocknum This variable is set to the index of the next block that will be read, immediately before self.read_block() is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current block number.
Instance Variable _current_toknum This variable is set to the index of the next token that will be read, immediately before self.read_block() is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current token number.
Instance Variable _encoding Undocumented
Instance Variable _eofpos The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached.
Instance Variable _fileid Undocumented
Instance Variable _len The total number of tokens in the corpus, if known; or None, if the number of tokens is not yet known.
Instance Variable _stream The stream used to access the underlying corpus file.
Instance Variable _toknum A list containing the token index of each block that has been processed. In particular, _toknum[i] is the token index of the first token in block i. Together with _filepos, this forms a partial mapping between token indices and file positions.
def __init__(self, *args, **kwargs): (source)

Create a new corpus view, based on the file fileid, and read with block_reader. See the class documentation for more information.

Parameters
*argsUndocumented
fileidThe path to the file that is read by this corpus view. fileid can either be a string or a PathPointer.
startposThe file position at which the view will start reading. This can be used to skip over preface sections.
encodingThe unicode encoding that should be used to read the file's contents. If no encoding is specified, then the file's contents will be read as a non-unicode string (i.e., a str).
**kwargsUndocumented