nltk.parse.util

module documentation

(source)

Utility functions for parsers.

Class	`TestGrammar`	Unit tests for CFG.
Function	`extract_test_sentences`	Parses a string with one test sentence per line. Lines can optionally begin with:
Function	`load_parser`	Load a grammar from a file, and build a parser based on that grammar. The parser depends on the grammar format, and might also depend on properties of the grammar itself.
Function	`taggedsent_to_conll`	A module to convert a single POS tagged sentence into CONLL format.
Function	`taggedsents_to_conll`	A module to convert the a POS tagged document stream (i.e. list of list of tuples, a list of sentences) and yield lines in CONLL format. This module yields one line per word and two newlines for end of sentence.

def extract_test_sentences(string, comment_chars='#%;', encoding=None): (source) ¶

Parses a string with one test sentence per line. Lines can optionally begin with:

a bool, saying if the sentence is grammatical or not, or

an int, giving the number of parse trees is should have,

The result information is followed by a colon, and then the sentence. Empty lines and lines beginning with a comment char are ignored.

Parameters
string	Undocumented
comment_chars	`str` of possible comment characters.
encoding	the encoding of the string, if it is binary
Returns
a list of tuple of sentences and expected results, where a sentence is a list of str, and a result is None, or bool, or int

def load_parser(grammar_url, trace=0, parser=None, chart_class=None, beam_size=0, **load_args): (source) ¶

Load a grammar from a file, and build a parser based on that grammar. The parser depends on the grammar format, and might also depend on properties of the grammar itself.

The following grammar formats are currently supported:

'cfg' (CFGs: CFG)
'pcfg' (probabilistic CFGs: PCFG)
'fcfg' (feature-based CFGs: FeatureGrammar)

Parameters
grammar_url:str	A URL specifying where the grammar is located. The default protocol is `"nltk:"`, which searches for the file in the the NLTK data package.
trace:int	The level of tracing that should be used when parsing a text. `0` will generate no tracing output; and higher numbers will produce more verbose tracing output.
parser	The class used for parsing; should be `ChartParser` or a subclass. If None, the class depends on the grammar format.
chart_class	The class used for storing the chart; should be `Chart` or a subclass. Only used for CFGs and feature CFGs. If None, the chart class depends on the grammar format.
beam_size:int	The maximum length for the parser's edge queue. Only used for probabilistic CFGs.
**load_args	Keyword parameters used when loading the grammar. See `data.load` for more information.

def taggedsent_to_conll(sentence): (source) ¶

A module to convert a single POS tagged sentence into CONLL format.

>>> from nltk import word_tokenize, pos_tag
>>> text = "This is a foobar sentence."
>>> for line in taggedsent_to_conll(pos_tag(word_tokenize(text))):
...         print(line, end="")
    1       This    _       DT      DT      _       0       a       _       _
    2       is      _       VBZ     VBZ     _       0       a       _       _
    3       a       _       DT      DT      _       0       a       _       _
    4       foobar  _       JJ      JJ      _       0       a       _       _
    5       sentence        _       NN      NN      _       0       a       _       _
    6       .               _       .       .       _       0       a       _       _

Parameters
sentence:list(tuple(str, str))	A single input sentence to parse
Returns
iter(str)	a generator yielding a single sentence in CONLL format.

def taggedsents_to_conll(sentences): (source) ¶

A module to convert the a POS tagged document stream (i.e. list of list of tuples, a list of sentences) and yield lines in CONLL format. This module yields one line per word and two newlines for end of sentence.

>>> from nltk import word_tokenize, sent_tokenize, pos_tag
>>> text = "This is a foobar sentence. Is that right?"
>>> sentences = [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
>>> for line in taggedsents_to_conll(sentences):
...     if line:
...         print(line, end="")
1   This    _       DT      DT      _       0       a       _       _
2   is      _       VBZ     VBZ     _       0       a       _       _
3   a       _       DT      DT      _       0       a       _       _
4   foobar  _       JJ      JJ      _       0       a       _       _
5   sentence        _       NN      NN      _       0       a       _       _
6   .               _       .       .       _       0       a       _       _
<BLANKLINE>
<BLANKLINE>
1   Is      _       VBZ     VBZ     _       0       a       _       _
2   that    _       IN      IN      _       0       a       _       _
3   right   _       NN      NN      _       0       a       _       _
4   ?       _       .       .       _       0       a       _       _
<BLANKLINE>
<BLANKLINE>

Parameters
sentences	Input sentences to parse
sentence:list(list(tuple(str, str)))	Undocumented
Returns
iter(str)	a generator yielding sentences in CONLL format.