class SeekableUnicodeStreamReader: (source)
Constructor: SeekableUnicodeStreamReader(stream, encoding, errors)
A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly. This is in contrast to codecs.StreamReader, which provide broken seek() and tell() methods.
This class was motivated by StreamBackedCorpusView, which makes extensive use of seek() and tell(), and needs to be able to handle unicode-encoded files.
Note: this class requires stateless decoders. To my knowledge, this shouldn't cause a problem with any of python's builtin unicode encodings.
Method | __del__ |
Undocumented |
Method | __enter__ |
Undocumented |
Method | __exit__ |
Undocumented |
Method | __init__ |
Undocumented |
Method | __iter__ |
Return self |
Method | __next__ |
Undocumented |
Method | char |
Move the read pointer forward by offset characters. |
Method | close |
Close the underlying stream. |
Method | discard |
Undocumented |
Method | next |
Return the next decoded line from the underlying stream. |
Method | read |
Read up to size bytes, decode them using this reader's encoding, and return the resulting unicode string. |
Method | readline |
Read a line of text, decode it using this reader's encoding, and return the resulting unicode string. |
Method | readlines |
Read this file's contents, decode them using this reader's encoding, and return it as a list of unicode lines. |
Method | seek |
Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared. |
Method | tell |
Return the current file position on the underlying byte stream. If this reader is maintaining any buffers, then the returned file position will be the position of the beginning of those buffers. |
Method | xreadlines |
Return self |
Constant | DEBUG |
Undocumented |
Instance Variable | bytebuffer |
A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character. |
Instance Variable | decode |
The function that is used to decode byte strings into unicode strings. |
Instance Variable | encoding |
The name of the encoding that should be used to encode the underlying stream. |
Instance Variable | errors |
The error mode that should be used when decoding data from the underlying stream. Can be 'strict', 'ignore', or 'replace'. |
Instance Variable | linebuffer |
A buffer used by readline() to hold characters that have been read, but have not yet been returned by read() or readline(). This buffer consists of a list of unicode strings, where each string corresponds to a single line... |
Instance Variable | stream |
The underlying stream. |
Property | closed |
True if the underlying stream is closed. |
Property | mode |
The mode of the underlying stream. |
Property | name |
The name of the underlying stream. |
Method | _char |
Move the file position forward by offset characters, ignoring all buffers. |
Method | _check |
Undocumented |
Method | _incr |
Decode the given byte string into a unicode string, using this reader's encoding. If an exception is encountered that appears to be caused by a truncation error, then just decode the byte string without the bytes that cause the trunctaion error. |
Method | _read |
Read up to size bytes from the underlying stream, decode them using this reader's encoding, and return the resulting unicode string. linebuffer is not included in the result. |
Constant | _BOM |
Undocumented |
Instance Variable | _bom |
The length of the byte order marker at the beginning of the stream (or None for no byte order marker). |
Instance Variable | _rewind |
The file position at which the most recent read on the underlying stream began. This is used, together with _rewind_numchars, to backtrack to the beginning of linebuffer (which is required by tell()). |
Instance Variable | _rewind |
The number of characters that have been returned since the read that started at _rewind_checkpoint. This is used, together with _rewind_checkpoint, to backtrack to the beginning of linebuffer (which is required by ... |
Read up to size bytes, decode them using this reader's encoding, and return the resulting unicode string.
Parameters | |
size:int | The maximum number of bytes to read. If not specified, then read as many bytes as possible. |
Returns | |
unicode | Undocumented |
Read a line of text, decode it using this reader's encoding, and return the resulting unicode string.
Parameters | |
size:int | The maximum number of bytes to read. If no newline is encountered before size bytes have been read, then the returned value may not be a complete line of text. |
Read this file's contents, decode them using this reader's encoding, and return it as a list of unicode lines.
Parameters | |
sizehint | Ignored. |
keepends | If false, then strip newlines. |
Returns | |
list(unicode) | Undocumented |
Move the stream to a new file position. If the reader is maintaining any buffers, then they will be cleared.
Parameters | |
offset | A byte count offset. |
whence | If 0, then the offset is from the start of the file (offset should be positive), if 1, then the offset is from the current position (offset may be positive or negative); and if 2, then the offset is from the end of the file (offset should typically be negative). |
Return the current file position on the underlying byte stream. If this reader is maintaining any buffers, then the returned file position will be the position of the beginning of those buffers.
A buffer to use bytes that have been read but have not yet been decoded. This is only used when the final bytes from a read do not form a complete encoding for a character.
The error mode that should be used when decoding data from the underlying stream. Can be 'strict', 'ignore', or 'replace'.
A buffer used by readline() to hold characters that have been read, but have not yet been returned by read() or readline(). This buffer consists of a list of unicode strings, where each string corresponds to a single line. The final element of the list may or may not be a complete line. Note that the existence of a linebuffer makes the tell() operation more complex, because it must backtrack to the beginning of the buffer to determine the correct file position in the underlying byte stream.
Move the file position forward by offset characters, ignoring all buffers.
Parameters | |
offset | Undocumented |
est | A hint, giving an estimate of the number of bytes that will be needed to move forward by offset chars. Defaults to offset. |
Decode the given byte string into a unicode string, using this reader's encoding. If an exception is encountered that appears to be caused by a truncation error, then just decode the byte string without the bytes that cause the trunctaion error.
Return a tuple (chars, num_consumed), where chars is the decoded unicode string, and num_consumed is the number of bytes that were consumed.
Read up to size bytes from the underlying stream, decode them using this reader's encoding, and return the resulting unicode string. linebuffer is not included in the result.
Undocumented
Value |
|
The length of the byte order marker at the beginning of the stream (or None for no byte order marker).