class documentation
class NKJPCorpus_Header_View(XMLCorpusView): (source)
Constructor: NKJPCorpus_Header_View(filename, **kwargs)
Undocumented
Method | __init__ |
HEADER_MODE A stream backed corpus view specialized for use with header.xml files in NKJP corpus. |
Method | handle |
Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt. |
Method | handle |
Undocumented |
Instance Variable | tagspec |
Undocumented |
Inherited from XMLCorpusView
:
Method | read |
Read from stream until we find at least one element that matches tagspec, and return the result of applying elt_handler to each element found. |
Method | _detect |
Undocumented |
Method | _read |
Read a string from the given stream that does not contain any un-closed tags. In particular, this function first reads a block from the stream of size self._BLOCK_SIZE. It then checks if that block contains an un-closed tag... |
Constant | _BLOCK |
Undocumented |
Constant | _DEBUG |
Undocumented |
Constant | _VALID |
Undocumented |
Constant | _XML |
Undocumented |
Constant | _XML |
Undocumented |
Instance Variable | _tag |
A dictionary mapping from file positions (as returned by stream.seek() to XML contexts. An XML context is a tuple of XML tag names, indicating which tags have not yet been closed. |
Instance Variable | _tagspec |
The tag specification for this corpus view. |
Inherited from StreamBackedCorpusView
(via XMLCorpusView
):
Method | __add__ |
Undocumented |
Method | __enter__ |
Undocumented |
Method | __exit__ |
Undocumented |
Method | __getitem__ |
Undocumented |
Method | __len__ |
Undocumented |
Method | __mul__ |
Undocumented |
Method | __radd__ |
Undocumented |
Method | __rmul__ |
Undocumented |
Method | close |
Close the file stream associated with this corpus view. This can be useful if you are worried about running out of file handles (although the stream should automatically be closed upon garbage collection of the corpus view)... |
Method | iterate |
Undocumented |
Class Variable | fileid |
Undocumented |
Method | _open |
Open the file stream associated with this corpus view. This will be called performed if any value is read from the view while its file stream is closed. |
Instance Variable | _block |
The function used to read a single block from the underlying file stream. |
Instance Variable | _cache |
A cache of the most recently read block. It is encoded as a tuple (start_toknum, end_toknum, tokens), where start_toknum is the token index of the first token in the block; end_toknum is the token index of the first token not in the block; and tokens is a list of the tokens in the block. |
Instance Variable | _current |
This variable is set to the index of the next block that will be read, immediately before self.read_block() is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current block number. |
Instance Variable | _current |
This variable is set to the index of the next token that will be read, immediately before self.read_block() is called. This is provided for the benefit of the block reader, which under rare circumstances may need to know the current token number. |
Instance Variable | _encoding |
Undocumented |
Instance Variable | _eofpos |
The character position of the last character in the file. This is calculated when the corpus view is initialized, and is used to decide when the end of file has been reached. |
Instance Variable | _fileid |
Undocumented |
Instance Variable | _filepos |
A list containing the file position of each block that has been processed. In particular, _toknum[i] is the file position of the first character in block i. Together with _toknum, this forms a partial mapping between token indices and file positions. |
Instance Variable | _len |
The total number of tokens in the corpus, if known; or None, if the number of tokens is not yet known. |
Instance Variable | _stream |
The stream used to access the underlying corpus file. |
Instance Variable | _toknum |
A list containing the token index of each block that has been processed. In particular, _toknum[i] is the token index of the first token in block i. Together with _filepos, this forms a partial mapping between token indices and file positions. |
Convert an element into an appropriate value for inclusion in the view. Unless overridden by a subclass or by the elt_handler constructor argument, this method simply returns elt.
Parameters | |
elt:ElementTree | The element that should be converted. |
context:str | A string composed of element tags separated by forward slashes, indicating the XML context of the given element. For example, the string 'foo/bar/baz' indicates that the element is a baz element whose parent is a bar element and whose grandparent is a top-level foo element. |
Returns | |
The view value corresponding to elt. |