class documentation

class ChunkString(object): (source)

Constructor: ChunkString(chunk_struct, debug_level)

View In Hierarchy

A string-based encoding of a particular chunking of a text. Internally, the ChunkString class uses a single string to encode the chunking of the input text. This string contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>

ChunkString are created from tagged texts (i.e., lists of tokens whose type is TaggedType). Initially, nothing is chunked.

The chunking of a ChunkString can be modified with the xform() method, which uses a regular expression to transform the string representation. These transformations should only add and remove braces; they should not modify the sequence of angle-bracket delimited tags.

Method __init__ Construct a new ChunkString that encodes the chunking of the text tagged_tokens.
Method __repr__ Return a string representation of this ChunkString. It has the form:
Method __str__ Return a formatted representation of this ChunkString. This representation will include extra spaces to ensure that tags will line up with the representation of other ChunkStrings for the same text, regardless of the chunking.
Method to_chunkstruct Return the chunk structure encoded by this ChunkString.
Method xform Apply the given transformation to the string encoding of this ChunkString. In particular, find all occurrences that match regexp, and replace them using repl (as done by re.sub).
Constant CHUNK_TAG Undocumented
Constant CHUNK_TAG_CHAR Undocumented
Constant IN_CHUNK_PATTERN A zero-width regexp pattern string that will only match positions that are in chunks.
Constant IN_STRIP_PATTERN A zero-width regexp pattern string that will only match positions that are in strips.
Method _tag Undocumented
Method _verify Check to make sure that s still corresponds to some chunked version of _pieces.
Constant _BALANCED_BRACKETS Undocumented
Constant _BRACKETS Undocumented
Constant _CHUNK Undocumented
Constant _STRIP Undocumented
Constant _VALID Undocumented
Instance Variable _debug The debug level. See the constructor docs.
Instance Variable _pieces The tagged tokens and chunks encoded by this ChunkString.
Instance Variable _root_label Undocumented
Instance Variable _str The internal string representation of the text's encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:...
def __init__(self, chunk_struct, debug_level=1): (source)

Construct a new ChunkString that encodes the chunking of the text tagged_tokens.

Parameters
chunk_struct:TreeThe chunk structure to be further chunked.
debug_level:int

The level of debugging which should be applied to transformations on the ChunkString. The valid levels are:

  • 0: no checks
  • 1: full check on to_chunkstruct
  • 2: full check on to_chunkstruct and cursory check after
    each transformation.
  • 3: full check on to_chunkstruct and full check after
    each transformation.

We recommend you use at least level 1. You should probably use level 3 if you use any non-standard subclasses of RegexpChunkRule.

def __repr__(self): (source)

Return a string representation of this ChunkString. It has the form:

<ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'>
Returns
strUndocumented
def __str__(self): (source)

Return a formatted representation of this ChunkString. This representation will include extra spaces to ensure that tags will line up with the representation of other ChunkStrings for the same text, regardless of the chunking.

Returns
strUndocumented
def to_chunkstruct(self, chunk_label='CHUNK'): (source)

Return the chunk structure encoded by this ChunkString.

Returns
TreeUndocumented
Raises
ValueErrorIf a transformation has generated an invalid chunkstring.
def xform(self, regexp, repl): (source)

Apply the given transformation to the string encoding of this ChunkString. In particular, find all occurrences that match regexp, and replace them using repl (as done by re.sub).

This transformation should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in improper bracketing. Note, in particular, that bracketing may not be nested.

Parameters
regexp:str or regexpA regular expression matching the substring that should be replaced. This will typically include a named group, which can be used by repl.
repl:strAn expression specifying what should replace the matched substring. Typically, this will include a named replacement group, specified by regexp.
Returns
NoneUndocumented
Raises
ValueErrorIf this transformation generated an invalid chunkstring.
CHUNK_TAG = (source)

Undocumented

Value
'(<%s+?>)' % CHUNK_TAG_CHAR
CHUNK_TAG_CHAR: str = (source)

Undocumented

Value
'[^\\{\\}<>]'
IN_CHUNK_PATTERN: str = (source)

A zero-width regexp pattern string that will only match positions that are in chunks.

Value
'(?=[^\\{]*\\})'
IN_STRIP_PATTERN: str = (source)

A zero-width regexp pattern string that will only match positions that are in strips.

Value
'(?=[^\\}]*(\\{|$))'
def _tag(self, tok): (source)

Undocumented

def _verify(self, s, verify_tags): (source)

Check to make sure that s still corresponds to some chunked version of _pieces.

Parameters
sUndocumented
verify_tags:boolWhether the individual tags should be checked. If this is false, _verify will check to make sure that _str encodes a chunked version of some list of tokens. If this is true, then _verify will check to make sure that the tags in _str match those in _pieces.
Raises
ValueErrorif the internal string representation of this ChunkString is invalid or not consistent with _pieces.
_BALANCED_BRACKETS = (source)

Undocumented

Value
re.compile(r'(\{\})*$')
_BRACKETS = (source)

Undocumented

Value
re.compile(r'[^\{\}]+')

Undocumented

Value
'(\\{%s+?\\})+?' % CHUNK_TAG

Undocumented

Value
'(%s+?)+?' % CHUNK_TAG

Undocumented

Value
re.compile(('^(\\{?%s\\}?)*?$' % CHUNK_TAG))

The debug level. See the constructor docs.

_pieces: list(tagged tokens and chunks) = (source)

The tagged tokens and chunks encoded by this ChunkString.

_root_label = (source)

Undocumented

_str: str = (source)

The internal string representation of the text's encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>