module documentation

This module supports TGrep2 syntax for matching parts of NLTK Trees. Note that many tgrep operators require the tree passed to be a ParentedTree.

External links:

Usage

>>> from nltk.tree import ParentedTree
>>> from nltk.tgrep import tgrep_nodes, tgrep_positions
>>> tree = ParentedTree.fromstring('(S (NP (DT the) (JJ big) (NN dog)) (VP bit) (NP (DT a) (NN cat)))')
>>> list(tgrep_nodes('NN', [tree]))
[[ParentedTree('NN', ['dog']), ParentedTree('NN', ['cat'])]]
>>> list(tgrep_positions('NN', [tree]))
[[(0, 2), (2, 1)]]
>>> list(tgrep_nodes('DT', [tree]))
[[ParentedTree('DT', ['the']), ParentedTree('DT', ['a'])]]
>>> list(tgrep_nodes('DT $ JJ', [tree]))
[[ParentedTree('DT', ['the'])]]

This implementation adds syntax to select nodes based on their NLTK tree position. This syntax is N plus a Python tuple representing the tree position. For instance, N(), N(0,), N(0,0) are valid node selectors. Example:

>>> tree = ParentedTree.fromstring('(S (NP (DT the) (JJ big) (NN dog)) (VP bit) (NP (DT a) (NN cat)))')
>>> tree[0,0]
ParentedTree('DT', ['the'])
>>> tree[0,0].treeposition()
(0, 0)
>>> list(tgrep_nodes('N(0,0)', [tree]))
[[ParentedTree('DT', ['the'])]]

Caveats:

  • Link modifiers: "?" and "=" are not implemented.
  • Tgrep compatibility: Using "@" for "!", "{" for "<", "}" for ">" are not implemented.
  • The "=" and "~" links are not implemented.

Known Issues:

  • There are some issues with link relations involving leaf nodes (which are represented as bare strings in NLTK trees). For instance, consider the tree:

    (S (A x))
    

    The search string * !>> S should select all nodes which are not dominated in some way by an S node (i.e., all nodes which are not descendants of an S). Clearly, in this tree, the only node which fulfills this criterion is the top node (since it is not dominated by anything). However, the code here will find both the top node and the leaf node x. This is because we cannot recover the parent of the leaf, since it is stored as a bare string.

    A possible workaround, when performing this kind of search, would be to filter out all leaf nodes.

Implementation notes

This implementation is (somewhat awkwardly) based on lambda functions which are predicates on a node. A predicate is a function which is either True or False; using a predicate function, we can identify sets of nodes with particular properties. A predicate function, could, for instance, return True only if a particular node has a label matching a particular regular expression, and has a daughter node which has no sisters. Because tgrep2 search strings can do things statefully (such as substituting in macros, and binding nodes with node labels), the actual predicate function is declared with three arguments:

pred = lambda n, m, l: return True # some logic here
n
is a node in a tree; this argument must always be given
m
contains a dictionary, mapping macro names onto predicate functions
l
is a dictionary to map node labels onto nodes in the tree

m and l are declared to default to None, and so need not be specified in a call to a predicate. Predicates which call other predicates must always pass the value of these arguments on. The top-level predicate (constructed by _tgrep_exprs_action) binds the macro definitions to m and initialises l to an empty dictionary.

Exception TgrepException Tgrep exception type.
Function ancestors Returns the list of all nodes dominating the given tree node. This method will not work with leaf nodes, since there is no way to recover the parent.
Function tgrep_compile Parses (and tokenizes, if necessary) a TGrep search string into a lambda function.
Function tgrep_nodes Return the tree nodes in the trees which match the given pattern.
Function tgrep_positions Return the tree positions in the trees which match the given pattern.
Function tgrep_tokenize Tokenizes a TGrep search string into separate tokens.
Function treepositions_no_leaves Returns all the tree positions in the given tree which are not leaf nodes.
Function unique_ancestors Returns the list of all nodes dominating the given node, where there is only a single path of descent.
Function _after Returns the set of all nodes that are after the given node.
Function _before Returns the set of all nodes that are before the given node.
Function _build_tgrep_parser Builds a pyparsing-based parser object for tokenizing and interpreting tgrep search strings.
Function _descendants Returns the list of all nodes which are descended from the given tree node in some way.
Function _immediately_after Returns the set of all nodes that are immediately after the given node.
Function _immediately_before Returns the set of all nodes that are immediately before the given node.
Function _istree Predicate to check whether obj is a nltk.tree.Tree.
Function _leftmost_descendants Returns the set of all nodes descended in some way through left branches from this node.
Function _macro_defn_action Builds a dictionary structure which defines the given macro.
Function _rightmost_descendants Returns the set of all nodes descended in some way through right branches from this node.
Function _tgrep_bind_node_label_action Builds a lambda function representing a predicate on a tree node which can optionally bind a matching node into the tgrep2 string's label_dict.
Function _tgrep_conjunction_action Builds a lambda function representing a predicate on a tree node from the conjunction of several other such lambda functions.
Function _tgrep_exprs_action This is the top-lebel node in a tgrep2 search string; the predicate function it returns binds together all the state of a tgrep2 search string.
Function _tgrep_macro_use_action Builds a lambda function which looks up the macro name used.
Function _tgrep_nltk_tree_pos_action Builds a lambda function representing a predicate on a tree node which returns true if the node is located at a specific tree position.
Function _tgrep_node_action Builds a lambda function representing a predicate on a tree node depending on the name of its node.
Function _tgrep_node_label_pred_use_action Builds a lambda function representing a predicate on a tree node which describes the use of a previously bound node label.
Function _tgrep_node_label_use_action Returns the node label used to begin a tgrep_expr_labeled. See _tgrep_segmented_pattern_action.
Function _tgrep_node_literal_value Gets the string value of a given parse tree node, for comparison using the tgrep node literal predicates.
Function _tgrep_parens_action Builds a lambda function representing a predicate on a tree node from a parenthetical notation.
Function _tgrep_rel_disjunction_action Builds a lambda function representing a predicate on a tree node from the disjunction of several other such lambda functions.
Function _tgrep_relation_action Builds a lambda function representing a predicate on a tree node depending on its relation to other nodes in the tree.
Function _tgrep_segmented_pattern_action Builds a lambda function representing a segmented pattern.
Function _unique_descendants Returns the list of all nodes descended from the given node, where there is only a single path of descent.
def ancestors(node): (source)

Returns the list of all nodes dominating the given tree node. This method will not work with leaf nodes, since there is no way to recover the parent.

def tgrep_compile(tgrep_string): (source)

Parses (and tokenizes, if necessary) a TGrep search string into a lambda function.

def tgrep_nodes(pattern, trees, search_leaves=True): (source)

Return the tree nodes in the trees which match the given pattern.

Parameters
pattern:str or output of tgrep_compile()a tgrep search pattern
trees:iter(ParentedTree) or iter(Tree)a sequence of NLTK trees (usually ParentedTrees)
search_leaves:boolwhether ot return matching leaf nodes
Returns
iter(tree nodes)Undocumented
def tgrep_positions(pattern, trees, search_leaves=True): (source)

Return the tree positions in the trees which match the given pattern.

Parameters
pattern:str or output of tgrep_compile()a tgrep search pattern
trees:iter(ParentedTree) or iter(Tree)a sequence of NLTK trees (usually ParentedTrees)
search_leaves:boolwhether ot return matching leaf nodes
Returns
iter(tree positions)Undocumented
def tgrep_tokenize(tgrep_string): (source)

Tokenizes a TGrep search string into separate tokens.

def treepositions_no_leaves(tree): (source)

Returns all the tree positions in the given tree which are not leaf nodes.

def unique_ancestors(node): (source)

Returns the list of all nodes dominating the given node, where there is only a single path of descent.

def _after(node): (source)

Returns the set of all nodes that are after the given node.

def _before(node): (source)

Returns the set of all nodes that are before the given node.

def _build_tgrep_parser(set_parse_actions=True): (source)

Builds a pyparsing-based parser object for tokenizing and interpreting tgrep search strings.

def _descendants(node): (source)

Returns the list of all nodes which are descended from the given tree node in some way.

def _immediately_after(node): (source)

Returns the set of all nodes that are immediately after the given node.

Tree node A immediately follows node B if the first terminal symbol (word) produced by A immediately follows the last terminal symbol produced by B.

def _immediately_before(node): (source)

Returns the set of all nodes that are immediately before the given node.

Tree node A immediately precedes node B if the last terminal symbol (word) produced by A immediately precedes the first terminal symbol produced by B.

def _istree(obj): (source)

Predicate to check whether obj is a nltk.tree.Tree.

def _leftmost_descendants(node): (source)

Returns the set of all nodes descended in some way through left branches from this node.

def _macro_defn_action(_s, _l, tokens): (source)

Builds a dictionary structure which defines the given macro.

def _rightmost_descendants(node): (source)

Returns the set of all nodes descended in some way through right branches from this node.

def _tgrep_bind_node_label_action(_s, _l, tokens): (source)

Builds a lambda function representing a predicate on a tree node which can optionally bind a matching node into the tgrep2 string's label_dict.

Called for expressions like (tgrep_node_expr2):

/NP/
@NP=n
def _tgrep_conjunction_action(_s, _l, tokens, join_char='&'): (source)

Builds a lambda function representing a predicate on a tree node from the conjunction of several other such lambda functions.

This is prototypically called for expressions like (tgrep_rel_conjunction):

< NP & < AP < VP

where tokens is a list of predicates representing the relations (< NP, < AP, and < VP), possibly with the character & included (as in the example here).

This is also called for expressions like (tgrep_node_expr2):

NP < NN
S=s < /NP/=n : s < /VP/=v : n .. v

tokens[0] is a tgrep_expr predicate; tokens[1:] are an (optional) list of segmented patterns (tgrep_expr_labeled, processed by _tgrep_segmented_pattern_action).

def _tgrep_exprs_action(_s, _l, tokens): (source)

This is the top-lebel node in a tgrep2 search string; the predicate function it returns binds together all the state of a tgrep2 search string.

Builds a lambda function representing a predicate on a tree node from the disjunction of several tgrep expressions. Also handles macro definitions and macro name binding, and node label definitions and node label binding.

def _tgrep_macro_use_action(_s, _l, tokens): (source)

Builds a lambda function which looks up the macro name used.

def _tgrep_nltk_tree_pos_action(_s, _l, tokens): (source)

Builds a lambda function representing a predicate on a tree node which returns true if the node is located at a specific tree position.

def _tgrep_node_action(_s, _l, tokens): (source)

Builds a lambda function representing a predicate on a tree node depending on the name of its node.

def _tgrep_node_label_pred_use_action(_s, _l, tokens): (source)

Builds a lambda function representing a predicate on a tree node which describes the use of a previously bound node label.

Called for expressions like (tgrep_node_label_use_pred):

=s

when they appear inside a tgrep_node_expr (for example, inside a relation). The predicate returns true if and only if its node argument is identical the the node looked up in the node label dictionary using the node's label.

def _tgrep_node_label_use_action(_s, _l, tokens): (source)

Returns the node label used to begin a tgrep_expr_labeled. See _tgrep_segmented_pattern_action.

Called for expressions like (tgrep_node_label_use):

=s

when they appear as the first element of a tgrep_expr_labeled expression (see _tgrep_segmented_pattern_action).

It returns the node label.

def _tgrep_node_literal_value(node): (source)

Gets the string value of a given parse tree node, for comparison using the tgrep node literal predicates.

def _tgrep_parens_action(_s, _l, tokens): (source)

Builds a lambda function representing a predicate on a tree node from a parenthetical notation.

def _tgrep_rel_disjunction_action(_s, _l, tokens): (source)

Builds a lambda function representing a predicate on a tree node from the disjunction of several other such lambda functions.

def _tgrep_relation_action(_s, _l, tokens): (source)

Builds a lambda function representing a predicate on a tree node depending on its relation to other nodes in the tree.

def _tgrep_segmented_pattern_action(_s, _l, tokens): (source)

Builds a lambda function representing a segmented pattern.

Called for expressions like (tgrep_expr_labeled):

=s .. =v < =n

This is a segmented pattern, a tgrep2 expression which begins with a node label.

The problem is that for segemented_pattern_action (': =v < =s'), the first element (in this case, =v) is specifically selected by virtue of matching a particular node in the tree; to retrieve the node, we need the label, not a lambda function. For node labels inside a tgrep_node_expr, we need a lambda function which returns true if the node visited is the same as =v.

We solve this by creating two copies of a node_label_use in the grammar; the label use inside a tgrep_expr_labeled has a separate parse action to the pred use inside a node_expr. See _tgrep_node_label_use_action and _tgrep_node_label_pred_use_action.

def _unique_descendants(node): (source)

Returns the list of all nodes descended from the given node, where there is only a single path of descent.