nltk.chunk.regexp.ChunkString

class documentation

class ChunkString(object): (source)

Constructor: ChunkString(chunk_struct, debug_level)

A string-based encoding of a particular chunking of a text. Internally, the ChunkString class uses a single string to encode the chunking of the input text. This string contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>

ChunkString are created from tagged texts (i.e., lists of tokens whose type is TaggedType). Initially, nothing is chunked.

The chunking of a ChunkString can be modified with the xform() method, which uses a regular expression to transform the string representation. These transformations should only add and remove braces; they should not modify the sequence of angle-bracket delimited tags.

Method	`__init__`	Construct a new `ChunkString` that encodes the chunking of the text `tagged_tokens`.
Method	`__repr__`	Return a string representation of this `ChunkString`. It has the form:
Method	`__str__`	Return a formatted representation of this `ChunkString`. This representation will include extra spaces to ensure that tags will line up with the representation of other `ChunkStrings` for the same text, regardless of the chunking.
Method	`to_chunkstruct`	Return the chunk structure encoded by this `ChunkString`.
Method	`xform`	Apply the given transformation to the string encoding of this `ChunkString`. In particular, find all occurrences that match `regexp`, and replace them using `repl` (as done by `re.sub`).
Constant	`CHUNK_TAG`	Undocumented
Constant	`CHUNK_TAG_CHAR`	Undocumented
Constant	`IN_CHUNK_PATTERN`	A zero-width regexp pattern string that will only match positions that are in chunks.
Constant	`IN_STRIP_PATTERN`	A zero-width regexp pattern string that will only match positions that are in strips.
Method	`_tag`	Undocumented
Method	`_verify`	Check to make sure that `s` still corresponds to some chunked version of `_pieces`.
Constant	`_BALANCED_BRACKETS`	Undocumented
Constant	`_BRACKETS`	Undocumented
Constant	`_CHUNK`	Undocumented
Constant	`_STRIP`	Undocumented
Constant	`_VALID`	Undocumented
Instance Variable	`_debug`	The debug level. See the constructor docs.
Instance Variable	`_pieces`	The tagged tokens and chunks encoded by this `ChunkString`.
Instance Variable	`_root_label`	Undocumented
Instance Variable	`_str`	The internal string representation of the text's encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:...

def __init__(self, chunk_struct, debug_level=1): (source) ¶

Construct a new ChunkString that encodes the chunking of the text tagged_tokens.

Parameters

chunk_struct:Tree The chunk structure to be further chunked.

debug_level:int

The level of debugging which should be applied to transformations on the ChunkString. The valid levels are:

0: no checks

1: full check on to_chunkstruct

2: full check on to_chunkstruct and cursory check after

each transformation.

3: full check on to_chunkstruct and full check after

each transformation.

We recommend you use at least level 1. You should probably use level 3 if you use any non-standard subclasses of RegexpChunkRule.

def __repr__(self): (source) ¶

Return a string representation of this ChunkString. It has the form:

<ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'>

Returns
str	Undocumented

def __str__(self): (source) ¶

Return a formatted representation of this ChunkString. This representation will include extra spaces to ensure that tags will line up with the representation of other ChunkStrings for the same text, regardless of the chunking.

Returns
str	Undocumented

def to_chunkstruct(self, chunk_label='CHUNK'): (source) ¶

Return the chunk structure encoded by this ChunkString.

Returns
Tree	Undocumented
Raises
`ValueError`	If a transformation has generated an invalid chunkstring.

def xform(self, regexp, repl): (source) ¶

Apply the given transformation to the string encoding of this ChunkString. In particular, find all occurrences that match regexp, and replace them using repl (as done by re.sub).

This transformation should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in improper bracketing. Note, in particular, that bracketing may not be nested.

Parameters
regexp:str or regexp	A regular expression matching the substring that should be replaced. This will typically include a named group, which can be used by `repl`.
repl:str	An expression specifying what should replace the matched substring. Typically, this will include a named replacement group, specified by `regexp`.
Returns
None	Undocumented
Raises
`ValueError`	If this transformation generated an invalid chunkstring.

CHUNK_TAG = (source) ¶

Undocumented

Value

'(<%s+?>)' % CHUNK_TAG_CHAR

CHUNK_TAG_CHAR: str = (source) ¶

Undocumented

Value

'[^\\{\\}<>]'

IN_CHUNK_PATTERN: str = (source) ¶

A zero-width regexp pattern string that will only match positions that are in chunks.

Value

'(?=[^\\{]*\\})'

IN_STRIP_PATTERN: str = (source) ¶

A zero-width regexp pattern string that will only match positions that are in strips.

Value

'(?=[^\\}]*(\\{|$))'

def _tag(self, tok): (source) ¶

Undocumented

def _verify(self, s, verify_tags): (source) ¶

Check to make sure that s still corresponds to some chunked version of _pieces.

Parameters
s	Undocumented
verify_tags:bool	Whether the individual tags should be checked. If this is false, `_verify` will check to make sure that `_str` encodes a chunked version of some list of tokens. If this is true, then `_verify` will check to make sure that the tags in `_str` match those in `_pieces`.
Raises
`ValueError`	if the internal string representation of this `ChunkString` is invalid or not consistent with _pieces.

_BALANCED_BRACKETS = (source) ¶

Undocumented

Value

re.compile(r'(\{\})*$')

_BRACKETS = (source) ¶

Undocumented

Value

re.compile(r'[^\{\}]+')

_CHUNK = (source) ¶

Undocumented

Value

'(\\{%s+?\\})+?' % CHUNK_TAG

_STRIP = (source) ¶

Undocumented

Value

'(%s+?)+?' % CHUNK_TAG

_VALID = (source) ¶

Undocumented

Value

re.compile(('^(\\{?%s\\}?)*?$' % CHUNK_TAG))

_debug = (source) ¶

The debug level. See the constructor docs.

_pieces: list(tagged tokens and chunks) = (source) ¶

The tagged tokens and chunks encoded by this ChunkString.

_root_label = (source) ¶

Undocumented

_str: str = (source) ¶

The internal string representation of the text's encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>