nltk.tag.brill_trainer.BrillTaggerTrainer

class documentation

class BrillTaggerTrainer(object): (source)

Constructor: BrillTaggerTrainer(initial_tagger, templates, trace, deterministic, ruleformat)

A trainer for tbl taggers.

Method	`__init__`	Construct a Brill tagger from a baseline tagger and a set of templates
Method	`train`	Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score, and each of which has accuracy not lower than ...
Method	`_apply_rule`	Update test_sents by applying rule everywhere where its conditions are met.
Method	`_best_rule`	Find the next best rule. This is done by repeatedly taking a rule with the highest score and stepping through the corpus to see where it applies. When it makes an error (decreasing its score) it's bumped down, and we try a new rule with the highest score...
Method	`_clean`	Undocumented
Method	`_find_rules`	Use the templates to find rules that apply at index wordnum in the sentence sent and generate the tag new_tag.
Method	`_init_mappings`	Initialize the tag position mapping & the rule related mappings. For each error in test_sents, find new rules that would correct them, and add them to the rule mappings.
Method	`_trace_apply`	Undocumented
Method	`_trace_header`	Undocumented
Method	`_trace_rule`	Undocumented
Method	`_trace_update_rules`	Undocumented
Method	`_update_rule_applies`	Update the rule data tables to reflect the fact that rule applies at the position (sentnum, wordnum).
Method	`_update_rule_not_applies`	Update the rule data tables to reflect the fact that rule does not apply at the position (sentnum, wordnum).
Method	`_update_rules`	Check if we should add or remove any rules from consideration, given the changes made by rule.
Method	`_update_tag_positions`	Update _tag_positions to reflect the changes to tags that are made by rule.
Instance Variable	`_deterministic`	Undocumented
Instance Variable	`_first_unknown_position`	Mapping from rules to the first position where we're unsure if the rule applies. This records the next position we need to check to see if the rule messed anything up.
Instance Variable	`_initial_tagger`	Undocumented
Instance Variable	`_positions_by_rule`	Mapping from rule to position to effect, specifying the effect that each rule has on the overall score, at each position. Position is (sentnum, wordnum); and effect is -1, 0, or 1. As with _rules_by_position, this mapping starts out only containing rules with positive effects; but when we examine a rule, we'll extend this mapping to include the positions where the rule is harmful or neutral.
Instance Variable	`_rule_scores`	Mapping from rules to upper bounds on their effects on the overall score. This is the inverse mapping to _rules_by_score. Invariant: ruleScores[r] = sum(_positions_by_rule[r])
Instance Variable	`_ruleformat`	Undocumented
Instance Variable	`_rules_by_position`	Mapping from positions to the set of rules that are known to occur at that position. Position is (sentnum, wordnum). Initially, this will only contain positions where each rule applies in a helpful way; but when we examine a rule, we'll extend this list to also include positions where each rule applies in a harmful or neutral way.
Instance Variable	`_rules_by_score`	Mapping from scores to the set of rules whose effect on the overall score is upper bounded by that score. Invariant: rulesByScore[s] will contain r iff the sum of _positions_by_rule[r] is s.
Instance Variable	`_tag_positions`	Mapping from tags to lists of positions that use that tag.
Instance Variable	`_templates`	Undocumented
Instance Variable	`_trace`	Undocumented

def __init__(self, initial_tagger, templates, trace=0, deterministic=None, ruleformat='str'): (source) ¶

Construct a Brill tagger from a baseline tagger and a set of templates

Parameters
initial_tagger:Tagger	the baseline tagger
templates:list of Templates	templates to be used in training
trace:int	verbosity level
deterministic:bool	if True, adjudicate ties deterministically
ruleformat:str	format of reported Rules
Returns
BrillTagger	An untrained BrillTagger

def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): (source) ¶

Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score, and each of which has accuracy not lower than min_acc.

#imports >>> from nltk.tbl.template import Template >>> from nltk.tag.brill import Pos, Word >>> from nltk.tag import untag, RegexpTagger, BrillTaggerTrainer

#some data >>> from nltk.corpus import treebank >>> training_data = treebank.tagged_sents()[:100] >>> baseline_data = treebank.tagged_sents()[100:200] >>> gold_data = treebank.tagged_sents()[200:300] >>> testing_data = [untag(s) for s in gold_data]

>>> backoff = RegexpTagger([
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
... (r'(The|the|A|a|An|an)$', 'AT'),   # articles
... (r'.*able$', 'JJ'),                # adjectives
... (r'.*ness$', 'NN'),                # nouns formed from adjectives
... (r'.*ly$', 'RB'),                  # adverbs
... (r'.*s$', 'NNS'),                  # plural nouns
... (r'.*ing$', 'VBG'),                # gerunds
... (r'.*ed$', 'VBD'),                 # past tense verbs
... (r'.*', 'NN')                      # nouns (default)
... ])

>>> baseline = backoff #see NOTE1

>>> baseline.evaluate(gold_data) #doctest: +ELLIPSIS
0.2450142...

#templates >>> Template._cleartemplates() #clear any templates created in earlier tests >>> templates = [Template(Pos([-1])), Template(Pos([-1]), Word([0]))]

#construct a BrillTaggerTrainer >>> tt = BrillTaggerTrainer(baseline, templates, trace=3)

>>> tagger1 = tt.train(training_data, max_rules=10)
TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: None)
Finding initial useful rules...
    Found 845 useful rules.
<BLANKLINE>
           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
 132 132   0   0  | AT->DT if Pos:NN@[-1]
  85  85   0   0  | NN->, if Pos:NN@[-1] & Word:,@[0]
  69  69   0   0  | NN->. if Pos:NN@[-1] & Word:.@[0]
  51  51   0   0  | NN->IN if Pos:NN@[-1] & Word:of@[0]
  47  63  16 161  | NN->IN if Pos:NNS@[-1]
  33  33   0   0  | NN->TO if Pos:NN@[-1] & Word:to@[0]
  26  26   0   0  | IN->. if Pos:NNS@[-1] & Word:.@[0]
  24  24   0   0  | IN->, if Pos:NNS@[-1] & Word:,@[0]
  22  27   5  24  | NN->-NONE- if Pos:VBD@[-1]
  17  17   0   0  | NN->CC if Pos:NN@[-1] & Word:and@[0]

>>> tagger1.rules()[1:3]
(Rule('001', 'NN', ',', [(Pos([-1]),'NN'), (Word([0]),',')]), Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]))

>>> train_stats = tagger1.train_stats()
>>> [train_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1775, 1269, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]]

>>> tagger1.print_template_statistics(printunused=False)
TEMPLATE STATISTICS (TRAIN)  2 templates, 10 rules)
TRAIN (   2417 tokens) initial  1775 0.2656 final:  1269 0.4750
#ID | Score (train) |  #Rules     | Template
--------------------------------------------
001 |   305   0.603 |   7   0.700 | Template(Pos([-1]),Word([0]))
000 |   201   0.397 |   3   0.300 | Template(Pos([-1]))
<BLANKLINE>
<BLANKLINE>

>>> tagger1.evaluate(gold_data) # doctest: +ELLIPSIS
0.43996...

>>> tagged, test_stats = tagger1.batch_tag_incremental(testing_data, gold_data)

>>> tagged[33][12:] == [('foreign', 'IN'), ('debt', 'NN'), ('of', 'IN'), ('$', 'NN'), ('64', 'CD'),
... ('billion', 'NN'), ('*U*', 'NN'), ('--', 'NN'), ('the', 'DT'), ('third-highest', 'NN'), ('in', 'NN'),
... ('the', 'DT'), ('developing', 'VBG'), ('world', 'NN'), ('.', '.')]
True

>>> [test_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1855, 1376, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]]

# a high-accuracy tagger >>> tagger2 = tt.train(training_data, max_rules=10, min_acc=0.99) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: 0.99) Finding initial useful rules...

Found 845 useful rules.

<BLANKLINE>

B |

------------------+-------------------------------------------------------

132 132 0 0 | AT->DT if Pos:NN@[-1]: 85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 36 36 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | NN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | NN->, if Pos:NNS@[-1] & Word:,@[0] 19 19 0 6 | NN->VB if Pos:TO@[-1] 18 18 0 0 | CD->-NONE- if Pos:NN@[-1] & Word:0@[0] 18 18 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]

>>> tagger2.evaluate(gold_data)  # doctest: +ELLIPSIS
0.44159544...
>>> tagger2.rules()[2:4]
(Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]), Rule('001', 'NN', 'IN', [(Pos([-1]),'NN'), (Word([0]),'of')]))

# NOTE1: (!!FIXME) A far better baseline uses nltk.tag.UnigramTagger, # with a RegexpTagger only as backoff. For instance, # >>> baseline = UnigramTagger(baseline_data, backoff=backoff) # However, as of Nov 2013, nltk.tag.UnigramTagger does not yield consistent results # between python versions. The simplistic backoff above is a workaround to make doctests # get consistent input.

Parameters
train_sents:list(list(tuple))	training data
max_rules:int	output at most max_rules rules
min_score:int	stop training when no rules better than min_score can be found
min_acc:float or None	discard any rule with lower accuracy than min_acc
Returns
BrillTagger	the learned tagger

def _apply_rule(self, rule, test_sents): (source) ¶

Update test_sents by applying rule everywhere where its conditions are met.

def _best_rule(self, train_sents, test_sents, min_score, min_acc): (source) ¶

Find the next best rule. This is done by repeatedly taking a rule with the highest score and stepping through the corpus to see where it applies. When it makes an error (decreasing its score) it's bumped down, and we try a new rule with the highest score. When we find a rule which has the highest score and which has been tested against the entire corpus, we can conclude that it's the next best rule.

def _clean(self): (source) ¶

Undocumented

def _find_rules(self, sent, wordnum, new_tag): (source) ¶

Use the templates to find rules that apply at index wordnum in the sentence sent and generate the tag new_tag.

def _init_mappings(self, test_sents, train_sents): (source) ¶

Initialize the tag position mapping & the rule related mappings. For each error in test_sents, find new rules that would correct them, and add them to the rule mappings.

def _trace_apply(self, num_updates): (source) ¶

Undocumented

def _trace_header(self): (source) ¶

Undocumented

def _trace_rule(self, rule): (source) ¶

Undocumented

def _trace_update_rules(self, num_obsolete, num_new, num_unseen): (source) ¶

Undocumented

def _update_rule_applies(self, rule, sentnum, wordnum, train_sents): (source) ¶

Update the rule data tables to reflect the fact that rule applies at the position (sentnum, wordnum).

def _update_rule_not_applies(self, rule, sentnum, wordnum): (source) ¶

Update the rule data tables to reflect the fact that rule does not apply at the position (sentnum, wordnum).

def _update_rules(self, rule, train_sents, test_sents): (source) ¶

Check if we should add or remove any rules from consideration, given the changes made by rule.

def _update_tag_positions(self, rule): (source) ¶

Update _tag_positions to reflect the changes to tags that are made by rule.

_deterministic = (source) ¶

Undocumented

_first_unknown_position = (source) ¶

Mapping from rules to the first position where we're unsure if the rule applies. This records the next position we need to check to see if the rule messed anything up.

_initial_tagger = (source) ¶

Undocumented

_positions_by_rule = (source) ¶

Mapping from rule to position to effect, specifying the effect that each rule has on the overall score, at each position. Position is (sentnum, wordnum); and effect is -1, 0, or 1. As with _rules_by_position, this mapping starts out only containing rules with positive effects; but when we examine a rule, we'll extend this mapping to include the positions where the rule is harmful or neutral.

_rule_scores = (source) ¶

Mapping from rules to upper bounds on their effects on the overall score. This is the inverse mapping to _rules_by_score. Invariant: ruleScores[r] = sum(_positions_by_rule[r])

_ruleformat = (source) ¶

Undocumented

_rules_by_position = (source) ¶

Mapping from positions to the set of rules that are known to occur at that position. Position is (sentnum, wordnum). Initially, this will only contain positions where each rule applies in a helpful way; but when we examine a rule, we'll extend this list to also include positions where each rule applies in a harmful or neutral way.

_rules_by_score = (source) ¶

Mapping from scores to the set of rules whose effect on the overall score is upper bounded by that score. Invariant: rulesByScore[s] will contain r iff the sum of _positions_by_rule[r] is s.

_tag_positions = (source) ¶

Mapping from tags to lists of positions that use that tag.

_templates = (source) ¶

Undocumented

_trace = (source) ¶

Undocumented