class ChunkString(object): (source)
Constructor: ChunkString(chunk_struct, debug_level)
A string-based encoding of a particular chunking of a text. Internally, the ChunkString class uses a single string to encode the chunking of the input text. This string contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:
{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>
ChunkString are created from tagged texts (i.e., lists of tokens whose type is TaggedType). Initially, nothing is chunked.
The chunking of a ChunkString can be modified with the xform() method, which uses a regular expression to transform the string representation. These transformations should only add and remove braces; they should not modify the sequence of angle-bracket delimited tags.
Method | __init__ |
Construct a new ChunkString that encodes the chunking of the text tagged_tokens. |
Method | __repr__ |
Return a string representation of this ChunkString. It has the form: |
Method | __str__ |
Return a formatted representation of this ChunkString. This representation will include extra spaces to ensure that tags will line up with the representation of other ChunkStrings for the same text, regardless of the chunking. |
Method | to |
Return the chunk structure encoded by this ChunkString. |
Method | xform |
Apply the given transformation to the string encoding of this ChunkString. In particular, find all occurrences that match regexp, and replace them using repl (as done by re.sub). |
Constant | CHUNK |
Undocumented |
Constant | CHUNK |
Undocumented |
Constant | IN |
A zero-width regexp pattern string that will only match positions that are in chunks. |
Constant | IN |
A zero-width regexp pattern string that will only match positions that are in strips. |
Method | _tag |
Undocumented |
Method | _verify |
Check to make sure that s still corresponds to some chunked version of _pieces. |
Constant | _BALANCED |
Undocumented |
Constant | _BRACKETS |
Undocumented |
Constant | _CHUNK |
Undocumented |
Constant | _STRIP |
Undocumented |
Constant | _VALID |
Undocumented |
Instance Variable | _debug |
The debug level. See the constructor docs. |
Instance Variable | _pieces |
The tagged tokens and chunks encoded by this ChunkString. |
Instance Variable | _root |
Undocumented |
Instance Variable | _str |
The internal string representation of the text's encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:... |
Construct a new ChunkString that encodes the chunking of the text tagged_tokens.
Parameters | |
chunk | The chunk structure to be further chunked. |
debug | The level of debugging which should be applied to transformations on the ChunkString. The valid levels are:
We recommend you use at least level 1. You should probably use level 3 if you use any non-standard subclasses of RegexpChunkRule. |
Return a string representation of this ChunkString. It has the form:
<ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'>
Returns | |
str | Undocumented |
Return a formatted representation of this ChunkString. This representation will include extra spaces to ensure that tags will line up with the representation of other ChunkStrings for the same text, regardless of the chunking.
Returns | |
str | Undocumented |
Return the chunk structure encoded by this ChunkString.
Returns | |
Tree | Undocumented |
Raises | |
ValueError | If a transformation has generated an invalid chunkstring. |
Apply the given transformation to the string encoding of this ChunkString. In particular, find all occurrences that match regexp, and replace them using repl (as done by re.sub).
This transformation should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in improper bracketing. Note, in particular, that bracketing may not be nested.
Parameters | |
regexp:str or regexp | A regular expression matching the substring that should be replaced. This will typically include a named group, which can be used by repl. |
repl:str | An expression specifying what should replace the matched substring. Typically, this will include a named replacement group, specified by regexp. |
Returns | |
None | Undocumented |
Raises | |
ValueError | If this transformation generated an invalid chunkstring. |
A zero-width regexp pattern string that will only match positions that are in chunks.
Value |
|
A zero-width regexp pattern string that will only match positions that are in strips.
Value |
|
Check to make sure that s still corresponds to some chunked version of _pieces.
Parameters | |
s | Undocumented |
verify | Whether the individual tags should be checked. If this is false, _verify will check to make sure that _str encodes a chunked version of some list of tokens. If this is true, then _verify will check to make sure that the tags in _str match those in _pieces. |
Raises | |
ValueError | if the internal string representation of this ChunkString is invalid or not consistent with _pieces. |