nltk.data.SeekableUnicodeStreamReader

class documentation

class SeekableUnicodeStreamReader: (source)

Constructor: SeekableUnicodeStreamReader(stream, encoding, errors)

A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly. This is in contrast to codecs.StreamReader, which provide broken seek() and tell() methods.

This class was motivated by StreamBackedCorpusView, which makes extensive use of seek() and tell(), and needs to be able to handle unicode-encoded files.

Note: this class requires stateless decoders. To my knowledge, this shouldn't cause a problem with any of python's builtin unicode encodings.

Method	`__del__`	Undocumented
Method	`__enter__`	Undocumented
Method	`__exit__`	Undocumented
Method	`__init__`	Undocumented
Method	`__iter__`	Return self
Method	`__next__`	Undocumented
Method	`char_seek_forward`	Move the read pointer forward by `offset` characters.
Method	`close`	Close the underlying stream.
Method	`discard_line`	Undocumented
Method	`next`	Return the next decoded line from the underlying stream.
Method	`read`	Read up to `size` bytes, decode them using this reader's encoding, and return the resulting unicode string.
Method	`readline`	Read a line of text, decode it using this reader's encoding, and return the resulting unicode string.
Method	`readlines`	Read this file's contents, decode them using this reader's encoding, and return it as a list of unicode lines.
Method	`seek`	Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.
Method	`tell`	Return the current file position on the underlying byte stream. If this reader is maintaining any buffers, then the returned file position will be the position of the beginning of those buffers.
Method	`xreadlines`	Return self
Constant	`DEBUG`	Undocumented
Instance Variable	`bytebuffer`	A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.
Instance Variable	`decode`	The function that is used to decode byte strings into unicode strings.
Instance Variable	`encoding`	The name of the encoding that should be used to encode the underlying stream.
Instance Variable	`errors`	The error mode that should be used when decoding data from the underlying stream. Can be 'strict', 'ignore', or 'replace'.
Instance Variable	`linebuffer`	A buffer used by `readline()` to hold characters that have been read, but have not yet been returned by `read()` or `readline()`. This buffer consists of a list of unicode strings, where each string corresponds to a single line...
Instance Variable	`stream`	The underlying stream.
Property	`closed`	True if the underlying stream is closed.
Property	`mode`	The mode of the underlying stream.
Property	`name`	The name of the underlying stream.
Method	`_char_seek_forward`	Move the file position forward by `offset` characters, ignoring all buffers.
Method	`_check_bom`	Undocumented
Method	`_incr_decode`	Decode the given byte string into a unicode string, using this reader's encoding. If an exception is encountered that appears to be caused by a truncation error, then just decode the byte string without the bytes that cause the trunctaion error.
Method	`_read`	Read up to `size` bytes from the underlying stream, decode them using this reader's encoding, and return the resulting unicode string. `linebuffer` is not included in the result.
Constant	`_BOM_TABLE`	Undocumented
Instance Variable	`_bom`	The length of the byte order marker at the beginning of the stream (or None for no byte order marker).
Instance Variable	`_rewind_checkpoint`	The file position at which the most recent read on the underlying stream began. This is used, together with `_rewind_numchars`, to backtrack to the beginning of `linebuffer` (which is required by `tell()`).
Instance Variable	`_rewind_numchars`	The number of characters that have been returned since the read that started at `_rewind_checkpoint`. This is used, together with `_rewind_checkpoint`, to backtrack to the beginning of `linebuffer` (which is required by ...

def __del__(self): (source) ¶

Undocumented

def __enter__(self): (source) ¶

Undocumented

def __exit__(self, type, value, traceback): (source) ¶

Undocumented

@py3_data
def __init__(self, stream, encoding, errors='strict'): (source) ¶

Undocumented

def __iter__(self): (source) ¶

Return self

def __next__(self): (source) ¶

Undocumented

def char_seek_forward(self, offset): (source) ¶

Move the read pointer forward by offset characters.

def close(self): (source) ¶

Close the underlying stream.

def discard_line(self): (source) ¶

Undocumented

def next(self): (source) ¶

Return the next decoded line from the underlying stream.

def read(self, size=None): (source) ¶

Read up to size bytes, decode them using this reader's encoding, and return the resulting unicode string.

Parameters
size:int	The maximum number of bytes to read. If not specified, then read as many bytes as possible.
Returns
unicode	Undocumented

def readline(self, size=None): (source) ¶

Read a line of text, decode it using this reader's encoding, and return the resulting unicode string.

Parameters
size:int	The maximum number of bytes to read. If no newline is encountered before `size` bytes have been read, then the returned value may not be a complete line of text.

def readlines(self, sizehint=None, keepends=True): (source) ¶

Read this file's contents, decode them using this reader's encoding, and return it as a list of unicode lines.

Parameters
sizehint	Ignored.
keepends	If false, then strip newlines.
Returns
list(unicode)	Undocumented

def seek(self, offset, whence=0): (source) ¶

Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.

Parameters
offset	A byte count offset.
whence	If 0, then the offset is from the start of the file (offset should be positive), if 1, then the offset is from the current position (offset may be positive or negative); and if 2, then the offset is from the end of the file (offset should typically be negative).

def tell(self): (source) ¶

Return the current file position on the underlying byte stream. If this reader is maintaining any buffers, then the returned file position will be the position of the beginning of those buffers.

def xreadlines(self): (source) ¶

Return self

DEBUG: bool = (source) ¶

Undocumented

Value

True

bytebuffer = (source) ¶

A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.

decode = (source) ¶

The function that is used to decode byte strings into unicode strings.

encoding = (source) ¶

The name of the encoding that should be used to encode the underlying stream.

errors = (source) ¶

The error mode that should be used when decoding data from the underlying stream. Can be 'strict', 'ignore', or 'replace'.

linebuffer = (source) ¶

A buffer used by readline() to hold characters that have been read, but have not yet been returned by read() or readline(). This buffer consists of a list of unicode strings, where each string corresponds to a single line. The final element of the list may or may not be a complete line. Note that the existence of a linebuffer makes the tell() operation more complex, because it must backtrack to the beginning of the buffer to determine the correct file position in the underlying byte stream.

stream = (source) ¶

The underlying stream.

@property
closed = (source) ¶

True if the underlying stream is closed.

@property
mode = (source) ¶

The mode of the underlying stream.

@property
name = (source) ¶

The name of the underlying stream.

def _char_seek_forward(self, offset, est_bytes=None): (source) ¶

Move the file position forward by offset characters, ignoring all buffers.

Parameters
offset	Undocumented
est_bytes	A hint, giving an estimate of the number of bytes that will be needed to move forward by `offset` chars. Defaults to `offset`.

def _check_bom(self): (source) ¶

Undocumented

def _incr_decode(self, bytes): (source) ¶

Decode the given byte string into a unicode string, using this reader's encoding. If an exception is encountered that appears to be caused by a truncation error, then just decode the byte string without the bytes that cause the trunctaion error.

Return a tuple (chars, num_consumed), where chars is the decoded unicode string, and num_consumed is the number of bytes that were consumed.

def _read(self, size=None): (source) ¶

Read up to size bytes from the underlying stream, decode them using this reader's encoding, and return the resulting unicode string. linebuffer is not included in the result.

_BOM_TABLE = (source) ¶

Undocumented

Value

{'utf8': [(codecs.BOM_UTF8, None)],
 'utf16': [(codecs.BOM_UTF16_LE, 'utf16-le'), (codecs.BOM_UTF16_BE, 'utf16-be')]↵
,
 'utf16le': [(codecs.BOM_UTF16_LE, None)],
 'utf16be': [(codecs.BOM_UTF16_BE, None)],
 'utf32': [(codecs.BOM_UTF32_LE, 'utf32-le'), (codecs.BOM_UTF32_BE, 'utf32-be')]↵
,
...

_bom = (source) ¶

The length of the byte order marker at the beginning of the stream (or None for no byte order marker).

_rewind_checkpoint = (source) ¶

The file position at which the most recent read on the underlying stream began. This is used, together with _rewind_numchars, to backtrack to the beginning of linebuffer (which is required by tell()).

_rewind_numchars = (source) ¶

The number of characters that have been returned since the read that started at _rewind_checkpoint. This is used, together with _rewind_checkpoint, to backtrack to the beginning of linebuffer (which is required by tell()).