This module supports TGrep2 syntax for matching parts of NLTK Trees. Note that many tgrep operators require the tree passed to be a ParentedTree.
External links:
Usage
>>> from nltk.tree import ParentedTree >>> from nltk.tgrep import tgrep_nodes, tgrep_positions >>> tree = ParentedTree.fromstring('(S (NP (DT the) (JJ big) (NN dog)) (VP bit) (NP (DT a) (NN cat)))') >>> list(tgrep_nodes('NN', [tree])) [[ParentedTree('NN', ['dog']), ParentedTree('NN', ['cat'])]] >>> list(tgrep_positions('NN', [tree])) [[(0, 2), (2, 1)]] >>> list(tgrep_nodes('DT', [tree])) [[ParentedTree('DT', ['the']), ParentedTree('DT', ['a'])]] >>> list(tgrep_nodes('DT $ JJ', [tree])) [[ParentedTree('DT', ['the'])]]
This implementation adds syntax to select nodes based on their NLTK tree position. This syntax is N plus a Python tuple representing the tree position. For instance, N(), N(0,), N(0,0) are valid node selectors. Example:
>>> tree = ParentedTree.fromstring('(S (NP (DT the) (JJ big) (NN dog)) (VP bit) (NP (DT a) (NN cat)))') >>> tree[0,0] ParentedTree('DT', ['the']) >>> tree[0,0].treeposition() (0, 0) >>> list(tgrep_nodes('N(0,0)', [tree])) [[ParentedTree('DT', ['the'])]]
Caveats:
- Link modifiers: "?" and "=" are not implemented.
- Tgrep compatibility: Using "@" for "!", "{" for "<", "}" for ">" are not implemented.
- The "=" and "~" links are not implemented.
Known Issues:
There are some issues with link relations involving leaf nodes (which are represented as bare strings in NLTK trees). For instance, consider the tree:
(S (A x))
The search string * !>> S should select all nodes which are not dominated in some way by an S node (i.e., all nodes which are not descendants of an S). Clearly, in this tree, the only node which fulfills this criterion is the top node (since it is not dominated by anything). However, the code here will find both the top node and the leaf node x. This is because we cannot recover the parent of the leaf, since it is stored as a bare string.
A possible workaround, when performing this kind of search, would be to filter out all leaf nodes.
Implementation notes
This implementation is (somewhat awkwardly) based on lambda functions which are predicates on a node. A predicate is a function which is either True or False; using a predicate function, we can identify sets of nodes with particular properties. A predicate function, could, for instance, return True only if a particular node has a label matching a particular regular expression, and has a daughter node which has no sisters. Because tgrep2 search strings can do things statefully (such as substituting in macros, and binding nodes with node labels), the actual predicate function is declared with three arguments:
pred = lambda n, m, l: return True # some logic here
- n
- is a node in a tree; this argument must always be given
- m
- contains a dictionary, mapping macro names onto predicate functions
- l
- is a dictionary to map node labels onto nodes in the tree
m and l are declared to default to None, and so need not be specified in a call to a predicate. Predicates which call other predicates must always pass the value of these arguments on. The top-level predicate (constructed by _tgrep_exprs_action) binds the macro definitions to m and initialises l to an empty dictionary.
Exception |
|
Tgrep exception type. |
Function | ancestors |
Returns the list of all nodes dominating the given tree node. This method will not work with leaf nodes, since there is no way to recover the parent. |
Function | tgrep |
Parses (and tokenizes, if necessary) a TGrep search string into a lambda function. |
Function | tgrep |
Return the tree nodes in the trees which match the given pattern. |
Function | tgrep |
Return the tree positions in the trees which match the given pattern. |
Function | tgrep |
Tokenizes a TGrep search string into separate tokens. |
Function | treepositions |
Returns all the tree positions in the given tree which are not leaf nodes. |
Function | unique |
Returns the list of all nodes dominating the given node, where there is only a single path of descent. |
Function | _after |
Returns the set of all nodes that are after the given node. |
Function | _before |
Returns the set of all nodes that are before the given node. |
Function | _build |
Builds a pyparsing-based parser object for tokenizing and interpreting tgrep search strings. |
Function | _descendants |
Returns the list of all nodes which are descended from the given tree node in some way. |
Function | _immediately |
Returns the set of all nodes that are immediately after the given node. |
Function | _immediately |
Returns the set of all nodes that are immediately before the given node. |
Function | _istree |
Predicate to check whether obj is a nltk.tree.Tree. |
Function | _leftmost |
Returns the set of all nodes descended in some way through left branches from this node. |
Function | _macro |
Builds a dictionary structure which defines the given macro. |
Function | _rightmost |
Returns the set of all nodes descended in some way through right branches from this node. |
Function | _tgrep |
Builds a lambda function representing a predicate on a tree node which can optionally bind a matching node into the tgrep2 string's label_dict. |
Function | _tgrep |
Builds a lambda function representing a predicate on a tree node from the conjunction of several other such lambda functions. |
Function | _tgrep |
This is the top-lebel node in a tgrep2 search string; the predicate function it returns binds together all the state of a tgrep2 search string. |
Function | _tgrep |
Builds a lambda function which looks up the macro name used. |
Function | _tgrep |
Builds a lambda function representing a predicate on a tree node which returns true if the node is located at a specific tree position. |
Function | _tgrep |
Builds a lambda function representing a predicate on a tree node depending on the name of its node. |
Function | _tgrep |
Builds a lambda function representing a predicate on a tree node which describes the use of a previously bound node label. |
Function | _tgrep |
Returns the node label used to begin a tgrep_expr_labeled. See _tgrep_segmented_pattern_action . |
Function | _tgrep |
Gets the string value of a given parse tree node, for comparison using the tgrep node literal predicates. |
Function | _tgrep |
Builds a lambda function representing a predicate on a tree node from a parenthetical notation. |
Function | _tgrep |
Builds a lambda function representing a predicate on a tree node from the disjunction of several other such lambda functions. |
Function | _tgrep |
Builds a lambda function representing a predicate on a tree node depending on its relation to other nodes in the tree. |
Function | _tgrep |
Builds a lambda function representing a segmented pattern. |
Function | _unique |
Returns the list of all nodes descended from the given node, where there is only a single path of descent. |
Returns the list of all nodes dominating the given tree node. This method will not work with leaf nodes, since there is no way to recover the parent.
Return the tree nodes in the trees which match the given pattern.
Parameters | |
pattern:str or output of tgrep_compile() | a tgrep search pattern |
trees:iter(ParentedTree) or iter(Tree) | a sequence of NLTK trees (usually ParentedTrees) |
search | whether ot return matching leaf nodes |
Returns | |
iter(tree nodes) | Undocumented |
Return the tree positions in the trees which match the given pattern.
Parameters | |
pattern:str or output of tgrep_compile() | a tgrep search pattern |
trees:iter(ParentedTree) or iter(Tree) | a sequence of NLTK trees (usually ParentedTrees) |
search | whether ot return matching leaf nodes |
Returns | |
iter(tree positions) | Undocumented |
Returns the list of all nodes dominating the given node, where there is only a single path of descent.
Returns the set of all nodes that are immediately after the given node.
Tree node A immediately follows node B if the first terminal symbol (word) produced by A immediately follows the last terminal symbol produced by B.
Returns the set of all nodes that are immediately before the given node.
Tree node A immediately precedes node B if the last terminal symbol (word) produced by A immediately precedes the first terminal symbol produced by B.
Builds a lambda function representing a predicate on a tree node which can optionally bind a matching node into the tgrep2 string's label_dict.
Called for expressions like (tgrep_node_expr2
):
/NP/ @NP=n
Builds a lambda function representing a predicate on a tree node from the conjunction of several other such lambda functions.
This is prototypically called for expressions like
(tgrep_rel_conjunction
):
< NP & < AP < VP
where tokens is a list of predicates representing the relations
(< NP
, < AP
, and < VP
), possibly with the character &
included (as in the example here).
This is also called for expressions like (tgrep_node_expr2
):
NP < NN S=s < /NP/=n : s < /VP/=v : n .. v
tokens[0] is a tgrep_expr predicate; tokens[1:] are an (optional)
list of segmented patterns (tgrep_expr_labeled
, processed by
_tgrep_segmented_pattern_action
).
This is the top-lebel node in a tgrep2 search string; the predicate function it returns binds together all the state of a tgrep2 search string.
Builds a lambda function representing a predicate on a tree node from the disjunction of several tgrep expressions. Also handles macro definitions and macro name binding, and node label definitions and node label binding.
Builds a lambda function representing a predicate on a tree node which returns true if the node is located at a specific tree position.
Builds a lambda function representing a predicate on a tree node which describes the use of a previously bound node label.
Called for expressions like (tgrep_node_label_use_pred
):
=s
when they appear inside a tgrep_node_expr (for example, inside a relation). The predicate returns true if and only if its node argument is identical the the node looked up in the node label dictionary using the node's label.
Returns the node label used to begin a tgrep_expr_labeled. See
_tgrep_segmented_pattern_action
.
Called for expressions like (tgrep_node_label_use
):
=s
when they appear as the first element of a tgrep_expr_labeled
expression (see _tgrep_segmented_pattern_action
).
It returns the node label.
Gets the string value of a given parse tree node, for comparison using the tgrep node literal predicates.
Builds a lambda function representing a predicate on a tree node from the disjunction of several other such lambda functions.
Builds a lambda function representing a predicate on a tree node depending on its relation to other nodes in the tree.
Builds a lambda function representing a segmented pattern.
Called for expressions like (tgrep_expr_labeled
):
=s .. =v < =n
This is a segmented pattern, a tgrep2 expression which begins with a node label.
The problem is that for segemented_pattern_action (': =v < =s'), the first element (in this case, =v) is specifically selected by virtue of matching a particular node in the tree; to retrieve the node, we need the label, not a lambda function. For node labels inside a tgrep_node_expr, we need a lambda function which returns true if the node visited is the same as =v.
We solve this by creating two copies of a node_label_use in the
grammar; the label use inside a tgrep_expr_labeled has a separate
parse action to the pred use inside a node_expr. See
_tgrep_node_label_use_action
and
_tgrep_node_label_pred_use_action
.