nltk.grammar

module documentation

(source)

Basic data classes for representing context free grammars. A "grammar" specifies which trees can represent the structure of a given text. Each of these trees is called a "parse tree" for the text (or simply a "parse"). In a "context free" grammar, the set of parse trees for any piece of a text can depend only on that piece, and not on the rest of the text (i.e., the piece's context). Context free grammars are often used to find possible syntactic structures for sentences. In this context, the leaves of a parse tree are word tokens; and the node values are phrasal categories, such as NP and VP.

The CFG class is used to encode context free grammars. Each CFG consists of a start symbol and a set of productions. The "start symbol" specifies the root node value for parse trees. For example, the start symbol for syntactic parsing is usually S. Start symbols are encoded using the Nonterminal class, which is discussed below.

A Grammar's "productions" specify what parent-child relationships a parse tree can contain. Each production specifies that a particular node can be the parent of a particular set of children. For example, the production <S> -> <NP> <VP> specifies that an S node can be the parent of an NP node and a VP node.

Grammar productions are implemented by the Production class. Each Production consists of a left hand side and a right hand side. The "left hand side" is a Nonterminal that specifies the node type for a potential parent; and the "right hand side" is a list that specifies allowable children for that parent. This lists consists of Nonterminals and text types: each Nonterminal indicates that the corresponding child may be a TreeToken with the specified node type; and each text type indicates that the corresponding child may be a Token with the with that type.

The Nonterminal class is used to distinguish node values from leaf values. This prevents the grammar from accidentally using a leaf value (such as the English word "A") as the node of a subtree. Within a CFG, all node values are wrapped in the Nonterminal class. Note, however, that the trees that are specified by the grammar do not include these Nonterminal wrappers.

Grammars can also be given a more procedural interpretation. According to this interpretation, a Grammar specifies any tree structure tree that can be produced by the following procedure:

Set tree to the start symbol

Repeat until tree contains no more nonterminal leaves:

Choose a production prod with whose left hand side

lhs is a nonterminal leaf of tree.

Replace the nonterminal leaf with a subtree, whose node

value is the value wrapped by the nonterminal lhs, and

whose children are the right hand side of prod.

The operation of replacing the left hand side (lhs) of a production with the right hand side (rhs) in a tree (tree) is known as "expanding" lhs to rhs in tree.

Class	`CFG`	A context-free grammar. A grammar consists of a start state and a set of productions. The set of terminals and nonterminals is implicitly specified by the productions.
Class	`DependencyGrammar`	A dependency grammar. A DependencyGrammar consists of a set of productions. Each production specifies a head/modifier relationship between a pair of words.
Class	`DependencyProduction`	A dependency grammar production. Each production maps a single head word to an unordered list of one or more modifier words.
Class	`FeatStructNonterminal`	A feature structure that's also a nonterminal. It acts as its own symbol, and automatically freezes itself when hashed.
Class	`FeatureGrammar`	A feature-based grammar. This is equivalent to a `CFG` whose nonterminals are all `FeatStructNonterminal`.
Class	`FeatureValueType`	A helper class for `FeatureGrammars`, designed to be different from ordinary strings. This is to stop the `FeatStruct` `FOO[]` from being compare equal to the terminal "FOO".
Class	`Nonterminal`	A non-terminal symbol for a context free grammar. `Nonterminal` is a wrapper class for node values; it is used by `Production` objects to distinguish node values from leaf values. The node value that is wrapped by a ...
Class	`PCFG`	A probabilistic context-free grammar. A PCFG consists of a start state and a set of productions with probabilities. The set of terminals and nonterminals is implicitly specified by the productions.
Class	`ProbabilisticDependencyGrammar`	No summary
Class	`ProbabilisticProduction`	A probabilistic context free grammar production. A PCFG `ProbabilisticProduction` is essentially just a `Production` that has an associated probability, which represents how likely it is that this production will be used...
Class	`Production`	A grammar production. Each production maps a single symbol on the "left-hand side" to a sequence of symbols on the "right-hand side". (In the case of context-free productions, the left-hand side must be a ...
Function	`cfg_demo`	A demonstration showing how `CFGs` can be created and used.
Function	`demo`	Undocumented
Function	`dg_demo`	A demonstration showing the creation and inspection of a `DependencyGrammar`.
Function	`fcfg_demo`	Undocumented
Function	`induce_pcfg`	Induce a PCFG grammar from a list of productions.
Function	`is_nonterminal`	No summary
Function	`is_terminal`	Return True if the item is a terminal, which currently is if it is hashable and not a `Nonterminal`.
Function	`nonterminals`	Given a string containing a list of symbol names, return a list of `Nonterminals` constructed from those symbols.
Function	`pcfg_demo`	A demonstration showing how a `PCFG` can be created and used.
Function	`read_grammar`	Return a pair consisting of a starting category and a list of `Productions`.
Function	`sdg_demo`	A demonstration of how to read a string representation of a CoNLL format dependency tree.
Function	`standard_nonterm_parser`	Undocumented
Variable	`toy_pcfg1`	Undocumented
Variable	`toy_pcfg2`	Undocumented
Function	`_read_cfg_production`	Return a list of context-free `Productions`.
Function	`_read_dependency_production`	Undocumented
Function	`_read_fcfg_production`	Return a list of feature-based `Productions`.
Function	`_read_pcfg_production`	Return a list of PCFG `ProbabilisticProductions`.
Function	`_read_production`	Parse a grammar rule, given as a string, and return a list of productions.
Constant	`_ARROW_RE`	Undocumented
Constant	`_DISJUNCTION_RE`	Undocumented
Constant	`_PROBABILITY_RE`	Undocumented
Constant	`_READ_DG_RE`	Undocumented
Constant	`_SPLIT_DG_RE`	Undocumented
Constant	`_STANDARD_NONTERM_RE`	Undocumented
Constant	`_TERMINAL_RE`	Undocumented

def cfg_demo(): (source) ¶

A demonstration showing how CFGs can be created and used.

def demo(): (source) ¶

Undocumented

def dg_demo(): (source) ¶

A demonstration showing the creation and inspection of a DependencyGrammar.

def fcfg_demo(): (source) ¶

Undocumented

def induce_pcfg(start, productions): (source) ¶

Induce a PCFG grammar from a list of productions.

The probability of a production A -> B C in a PCFG is:

count(A -> B C)

P(B, C | A) = --------------- where * is any right hand side

count(A -> *)

Parameters
start:Nonterminal	The start symbol
productions:list(Production)	The list of productions that defines the grammar

def is_nonterminal(item): (source) ¶

Returns
bool	True if the item is a `Nonterminal`.

def is_terminal(item): (source) ¶

Return True if the item is a terminal, which currently is if it is hashable and not a Nonterminal.

Returns
bool	Undocumented

def nonterminals(symbols): (source) ¶

Given a string containing a list of symbol names, return a list of Nonterminals constructed from those symbols.

Parameters
symbols:str	The symbol name string. This string can be delimited by either spaces or commas.
Returns
list(Nonterminal)	A list of `Nonterminals` constructed from the symbol names given in `symbols`. The `Nonterminals` are sorted in the same order as the symbols names.

def pcfg_demo(): (source) ¶

A demonstration showing how a PCFG can be created and used.

def read_grammar(input, nonterm_parser, probabilistic=False, encoding=None): (source) ¶

Return a pair consisting of a starting category and a list of Productions.

Parameters
input	a grammar, either in the form of a string or else as a list of strings.
nonterm_parser	a function for parsing nonterminals. It should take a `(string, position)` as argument and return a `(nonterminal, position)` as result.
probabilistic:bool	are the grammar rules probabilistic?
encoding:str	the encoding of the grammar, if it is a binary string

def sdg_demo(): (source) ¶

A demonstration of how to read a string representation of a CoNLL format dependency tree.

def standard_nonterm_parser(string, pos): (source) ¶

Undocumented

toy_pcfg1 = (source) ¶

Undocumented

toy_pcfg2 = (source) ¶

Undocumented

def _read_cfg_production(input): (source) ¶

Return a list of context-free Productions.

def _read_dependency_production(s): (source) ¶

Undocumented

def _read_fcfg_production(input, fstruct_reader): (source) ¶

Return a list of feature-based Productions.

def _read_pcfg_production(input): (source) ¶

Return a list of PCFG ProbabilisticProductions.

def _read_production(line, nonterm_parser, probabilistic=False): (source) ¶

Parse a grammar rule, given as a string, and return a list of productions.

_ARROW_RE = (source) ¶

Undocumented

Value

re.compile(r'\s* -> \s*',
           re.VERBOSE)

_DISJUNCTION_RE = (source) ¶

Undocumented

Value

re.compile(r'\| \s*',
           re.VERBOSE)

_PROBABILITY_RE = (source) ¶

Undocumented

Value

re.compile(r'( \[ [\d\.]+ \] ) \s*',
           re.VERBOSE)

_READ_DG_RE = (source) ¶

Undocumented

Value

re.compile('''^\\s*                # leading whitespace
                              (\'[^\']+\')\\s*        # single-quoted lhs
                              (?:[-=]+>)\\s*        # arrow
                              (?:(                 # rhs:
                                   "[^"]+"         # doubled-quoted terminal
                                 | \'[^\']+\'         # single-quoted terminal
                                 | \\|              # disjunction
...

_SPLIT_DG_RE = (source) ¶

Undocumented

Value

re.compile(r'(\'[^\']\'|[-=]+>|"[^"]+"|\'[^\']+\'|\|)')

_STANDARD_NONTERM_RE = (source) ¶

Undocumented

Value

re.compile(r'( [\w/][\w/\^<>-]* ) \s*',
           re.VERBOSE)

_TERMINAL_RE = (source) ¶

Undocumented

Value

re.compile(r'( "[^"]+" |\'[^\']+\' ) \s*',
           re.VERBOSE)