class documentation

A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly. This is in contrast to codecs.StreamReader, which provide broken seek() and tell() methods.

This class was motivated by StreamBackedCorpusView, which makes extensive use of seek() and tell(), and needs to be able to handle unicode-encoded files.

Note: this class requires stateless decoders. To my knowledge, this shouldn't cause a problem with any of python's builtin unicode encodings.

Method __del__ Undocumented
Method __enter__ Undocumented
Method __exit__ Undocumented
Method __init__ Undocumented
Method __iter__ Return self
Method __next__ Undocumented
Method char_seek_forward Move the read pointer forward by offset characters.
Method close Close the underlying stream.
Method discard_line Undocumented
Method next Return the next decoded line from the underlying stream.
Method read Read up to size bytes, decode them using this reader's encoding, and return the resulting unicode string.
Method readline Read a line of text, decode it using this reader's encoding, and return the resulting unicode string.
Method readlines Read this file's contents, decode them using this reader's encoding, and return it as a list of unicode lines.
Method seek Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.
Method tell Return the current file position on the underlying byte stream. If this reader is maintaining any buffers, then the returned file position will be the position of the beginning of those buffers.
Method xreadlines Return self
Constant DEBUG Undocumented
Instance Variable bytebuffer A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.
Instance Variable decode The function that is used to decode byte strings into unicode strings.
Instance Variable encoding The name of the encoding that should be used to encode the underlying stream.
Instance Variable errors The error mode that should be used when decoding data from the underlying stream. Can be 'strict', 'ignore', or 'replace'.
Instance Variable linebuffer A buffer used by readline() to hold characters that have been read, but have not yet been returned by read() or readline(). This buffer consists of a list of unicode strings, where each string corresponds to a single line...
Instance Variable stream The underlying stream.
Property closed True if the underlying stream is closed.
Property mode The mode of the underlying stream.
Property name The name of the underlying stream.
Method _char_seek_forward Move the file position forward by offset characters, ignoring all buffers.
Method _check_bom Undocumented
Method _incr_decode Decode the given byte string into a unicode string, using this reader's encoding. If an exception is encountered that appears to be caused by a truncation error, then just decode the byte string without the bytes that cause the trunctaion error.
Method _read Read up to size bytes from the underlying stream, decode them using this reader's encoding, and return the resulting unicode string. linebuffer is not included in the result.
Constant _BOM_TABLE Undocumented
Instance Variable _bom The length of the byte order marker at the beginning of the stream (or None for no byte order marker).
Instance Variable _rewind_checkpoint The file position at which the most recent read on the underlying stream began. This is used, together with _rewind_numchars, to backtrack to the beginning of linebuffer (which is required by tell()).
Instance Variable _rewind_numchars The number of characters that have been returned since the read that started at _rewind_checkpoint. This is used, together with _rewind_checkpoint, to backtrack to the beginning of linebuffer (which is required by ...
def __del__(self): (source)

Undocumented

def __enter__(self): (source)

Undocumented

def __exit__(self, type, value, traceback): (source)

Undocumented

@py3_data
def __init__(self, stream, encoding, errors='strict'): (source)

Undocumented

def __iter__(self): (source)

Return self

def __next__(self): (source)

Undocumented

def char_seek_forward(self, offset): (source)

Move the read pointer forward by offset characters.

def close(self): (source)

Close the underlying stream.

def discard_line(self): (source)

Undocumented

def next(self): (source)

Return the next decoded line from the underlying stream.

def read(self, size=None): (source)

Read up to size bytes, decode them using this reader's encoding, and return the resulting unicode string.

Parameters
size:intThe maximum number of bytes to read. If not specified, then read as many bytes as possible.
Returns
unicodeUndocumented
def readline(self, size=None): (source)

Read a line of text, decode it using this reader's encoding, and return the resulting unicode string.

Parameters
size:intThe maximum number of bytes to read. If no newline is encountered before size bytes have been read, then the returned value may not be a complete line of text.
def readlines(self, sizehint=None, keepends=True): (source)

Read this file's contents, decode them using this reader's encoding, and return it as a list of unicode lines.

Parameters
sizehintIgnored.
keependsIf false, then strip newlines.
Returns
list(unicode)Undocumented
def seek(self, offset, whence=0): (source)

Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.

Parameters
offsetA byte count offset.
whenceIf 0, then the offset is from the start of the file (offset should be positive), if 1, then the offset is from the current position (offset may be positive or negative); and if 2, then the offset is from the end of the file (offset should typically be negative).
def tell(self): (source)

Return the current file position on the underlying byte stream. If this reader is maintaining any buffers, then the returned file position will be the position of the beginning of those buffers.

def xreadlines(self): (source)

Return self

DEBUG: bool = (source)

Undocumented

Value
True
bytebuffer = (source)

A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.

The function that is used to decode byte strings into unicode strings.

encoding = (source)

The name of the encoding that should be used to encode the underlying stream.

The error mode that should be used when decoding data from the underlying stream. Can be 'strict', 'ignore', or 'replace'.

linebuffer = (source)

A buffer used by readline() to hold characters that have been read, but have not yet been returned by read() or readline(). This buffer consists of a list of unicode strings, where each string corresponds to a single line. The final element of the list may or may not be a complete line. Note that the existence of a linebuffer makes the tell() operation more complex, because it must backtrack to the beginning of the buffer to determine the correct file position in the underlying byte stream.

The underlying stream.

@property
closed = (source)

True if the underlying stream is closed.

@property
mode = (source)

The mode of the underlying stream.

@property
name = (source)

The name of the underlying stream.

def _char_seek_forward(self, offset, est_bytes=None): (source)

Move the file position forward by offset characters, ignoring all buffers.

Parameters
offsetUndocumented
est_bytesA hint, giving an estimate of the number of bytes that will be needed to move forward by offset chars. Defaults to offset.
def _check_bom(self): (source)

Undocumented

def _incr_decode(self, bytes): (source)

Decode the given byte string into a unicode string, using this reader's encoding. If an exception is encountered that appears to be caused by a truncation error, then just decode the byte string without the bytes that cause the trunctaion error.

Return a tuple (chars, num_consumed), where chars is the decoded unicode string, and num_consumed is the number of bytes that were consumed.

def _read(self, size=None): (source)

Read up to size bytes from the underlying stream, decode them using this reader's encoding, and return the resulting unicode string. linebuffer is not included in the result.

_BOM_TABLE = (source)

Undocumented

Value
{'utf8': [(codecs.BOM_UTF8, None)],
 'utf16': [(codecs.BOM_UTF16_LE, 'utf16-le'), (codecs.BOM_UTF16_BE, 'utf16-be')]
,
 'utf16le': [(codecs.BOM_UTF16_LE, None)],
 'utf16be': [(codecs.BOM_UTF16_BE, None)],
 'utf32': [(codecs.BOM_UTF32_LE, 'utf32-le'), (codecs.BOM_UTF32_BE, 'utf32-be')]
,
...

The length of the byte order marker at the beginning of the stream (or None for no byte order marker).

_rewind_checkpoint = (source)

The file position at which the most recent read on the underlying stream began. This is used, together with _rewind_numchars, to backtrack to the beginning of linebuffer (which is required by tell()).

_rewind_numchars = (source)

The number of characters that have been returned since the read that started at _rewind_checkpoint. This is used, together with _rewind_checkpoint, to backtrack to the beginning of linebuffer (which is required by tell()).