class documentation

A stream backed corpus view specialized for use with text.xml files in NKJP corpus.

Method __init__ Create a new corpus view based on a specified XML file.
Method get_segm_id Undocumented
Method handle_elt Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.
Method handle_query Undocumented
Method read_block Returns text as a list of sentences.
Constant RAW_MODE Undocumented
Constant SENTS_MODE Undocumented
Instance Variable mode Undocumented
Instance Variable segm_dict Undocumented
Instance Variable tagspec Undocumented
Instance Variable xml_tool Undocumented

Inherited from XMLCorpusView:

Method _detect_encoding Undocumented
Method _read_xml_fragment Read a string from the given stream that does not contain any un-closed tags. In particular, this function first reads a block from the stream of size self._BLOCK_SIZE. It then checks if that block contains an un-closed tag...
Constant _BLOCK_SIZE Undocumented
Constant _DEBUG Undocumented
Constant _VALID_XML_RE Undocumented
Constant _XML_PIECE Undocumented
Constant _XML_TAG_NAME Undocumented
Instance Variable _tag_context A dictionary mapping from file positions (as returned by stream.seek() to XML contexts. An XML context is a tuple of XML tag names, indicating which tags have not yet been closed.
Instance Variable _tagspec The tag specification for this corpus view.

Inherited from StreamBackedCorpusView (via XMLCorpusView):

Method __add__ Undocumented
Method __enter__ Undocumented
Method __exit__ Undocumented
Method __getitem__ Undocumented
Method __len__ Undocumented
Method __mul__ Undocumented
Method __radd__ Undocumented
Method __rmul__ Undocumented
Method close Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view)...
Method iterate_from Undocumented
Class Variable fileid Undocumented
Method _open Open the file stream associated with this corpus view. This will be called performed if any value is read from the view while its file stream is closed.
Instance Variable _block_reader The function used to read a single block from the underlying file stream.
Instance Variable _cache A cache of the most recently read block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block.
Instance Variable _current_blocknum This variable is set to the index of the next block that will be read, immediately before self.read_block() is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current block number.
Instance Variable _current_toknum This variable is set to the index of the next token that will be read, immediately before self.read_block() is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current token number.
Instance Variable _encoding Undocumented
Instance Variable _eofpos The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached.
Instance Variable _fileid Undocumented
Instance Variable _filepos A list containing the file position of each block that has been processed. In particular, _toknum[i] is the file position of the first character in block i. Together with _toknum, this forms a partial mapping between token indices and file positions.
Instance Variable _len The total number of tokens in the corpus, if known; or None, if the number of tokens is not yet known.
Instance Variable _stream The stream used to access the underlying corpus file.
Instance Variable _toknum A list containing the token index of each block that has been processed. In particular, _toknum[i] is the token index of the first token in block i. Together with _filepos, this forms a partial mapping between token indices and file positions.
def __init__(self, filename, **kwargs): (source)

Create a new corpus view based on a specified XML file.

Note that the XMLCorpusView constructor does not take an encoding argument, because the unicode encoding is specified by the XML files themselves.

Parameters
filenameUndocumented
tagspec:strA tag specification, indicating what XML elements should be included in the view. Each non-nested element that matches this specification corresponds to one item in the view.
elt_handler

A function used to transform each element to a value for the view. If no handler is specified, then self.handle_elt() is called, which returns the element as an ElementTree object. The signature of elt_handler is:

elt_handler(elt, tagspec) -> value
**kwargsUndocumented
def get_segm_id(self, elt): (source)

Undocumented

def handle_elt(self, elt, context): (source)

Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.

Parameters
elt:ElementTreeThe element that should be converted.
context:strA string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element.
Returns
The view value corresponding to elt.
def handle_query(self): (source)

Undocumented

def read_block(self, stream, tagspec=None, elt_handler=None): (source)

Returns text as a list of sentences.

RAW_MODE: int = (source)

Undocumented

Value
1
SENTS_MODE: int = (source)

Undocumented

Value
0

Undocumented

segm_dict = (source)

Undocumented

tagspec: str = (source)

Undocumented

xml_tool = (source)

Undocumented