class documentation

A trainer for tbl taggers.

Method __init__ Construct a Brill tagger from a baseline tagger and a set of templates
Method train Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score, and each of which has accuracy not lower than ...
Method _apply_rule Update test_sents by applying rule everywhere where its conditions are met.
Method _best_rule Find the next best rule. This is done by repeatedly taking a rule with the highest score and stepping through the corpus to see where it applies. When it makes an error (decreasing its score) it's bumped down, and we try a new rule with the highest score...
Method _clean Undocumented
Method _find_rules Use the templates to find rules that apply at index wordnum in the sentence sent and generate the tag new_tag.
Method _init_mappings Initialize the tag position mapping & the rule related mappings. For each error in test_sents, find new rules that would correct them, and add them to the rule mappings.
Method _trace_apply Undocumented
Method _trace_header Undocumented
Method _trace_rule Undocumented
Method _trace_update_rules Undocumented
Method _update_rule_applies Update the rule data tables to reflect the fact that rule applies at the position (sentnum, wordnum).
Method _update_rule_not_applies Update the rule data tables to reflect the fact that rule does not apply at the position (sentnum, wordnum).
Method _update_rules Check if we should add or remove any rules from consideration, given the changes made by rule.
Method _update_tag_positions Update _tag_positions to reflect the changes to tags that are made by rule.
Instance Variable _deterministic Undocumented
Instance Variable _first_unknown_position Mapping from rules to the first position where we're unsure if the rule applies. This records the next position we need to check to see if the rule messed anything up.
Instance Variable _initial_tagger Undocumented
Instance Variable _positions_by_rule Mapping from rule to position to effect, specifying the effect that each rule has on the overall score, at each position. Position is (sentnum, wordnum); and effect is -1, 0, or 1. As with _rules_by_position, this mapping starts out only containing rules with positive effects; but when we examine a rule, we'll extend this mapping to include the positions where the rule is harmful or neutral.
Instance Variable _rule_scores Mapping from rules to upper bounds on their effects on the overall score. This is the inverse mapping to _rules_by_score. Invariant: ruleScores[r] = sum(_positions_by_rule[r])
Instance Variable _ruleformat Undocumented
Instance Variable _rules_by_position Mapping from positions to the set of rules that are known to occur at that position. Position is (sentnum, wordnum). Initially, this will only contain positions where each rule applies in a helpful way; but when we examine a rule, we'll extend this list to also include positions where each rule applies in a harmful or neutral way.
Instance Variable _rules_by_score Mapping from scores to the set of rules whose effect on the overall score is upper bounded by that score. Invariant: rulesByScore[s] will contain r iff the sum of _positions_by_rule[r] is s.
Instance Variable _tag_positions Mapping from tags to lists of positions that use that tag.
Instance Variable _templates Undocumented
Instance Variable _trace Undocumented
def __init__(self, initial_tagger, templates, trace=0, deterministic=None, ruleformat='str'): (source)

Construct a Brill tagger from a baseline tagger and a set of templates

Parameters
initial_tagger:Taggerthe baseline tagger
templates:list of Templatestemplates to be used in training
trace:intverbosity level
deterministic:boolif True, adjudicate ties deterministically
ruleformat:strformat of reported Rules
Returns
BrillTaggerAn untrained BrillTagger
def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): (source)

Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score, and each of which has accuracy not lower than min_acc.

#imports >>> from nltk.tbl.template import Template >>> from nltk.tag.brill import Pos, Word >>> from nltk.tag import untag, RegexpTagger, BrillTaggerTrainer

#some data >>> from nltk.corpus import treebank >>> training_data = treebank.tagged_sents()[:100] >>> baseline_data = treebank.tagged_sents()[100:200] >>> gold_data = treebank.tagged_sents()[200:300] >>> testing_data = [untag(s) for s in gold_data]

>>> backoff = RegexpTagger([
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
... (r'(The|the|A|a|An|an)$', 'AT'),   # articles
... (r'.*able$', 'JJ'),                # adjectives
... (r'.*ness$', 'NN'),                # nouns formed from adjectives
... (r'.*ly$', 'RB'),                  # adverbs
... (r'.*s$', 'NNS'),                  # plural nouns
... (r'.*ing$', 'VBG'),                # gerunds
... (r'.*ed$', 'VBD'),                 # past tense verbs
... (r'.*', 'NN')                      # nouns (default)
... ])
>>> baseline = backoff #see NOTE1
>>> baseline.evaluate(gold_data) #doctest: +ELLIPSIS
0.2450142...

#templates >>> Template._cleartemplates() #clear any templates created in earlier tests >>> templates = [Template(Pos([-1])), Template(Pos([-1]), Word([0]))]

#construct a BrillTaggerTrainer >>> tt = BrillTaggerTrainer(baseline, templates, trace=3)

>>> tagger1 = tt.train(training_data, max_rules=10)
TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: None)
Finding initial useful rules...
    Found 845 useful rules.
<BLANKLINE>
           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
 132 132   0   0  | AT->DT if Pos:NN@[-1]
  85  85   0   0  | NN->, if Pos:NN@[-1] & Word:,@[0]
  69  69   0   0  | NN->. if Pos:NN@[-1] & Word:.@[0]
  51  51   0   0  | NN->IN if Pos:NN@[-1] & Word:of@[0]
  47  63  16 161  | NN->IN if Pos:NNS@[-1]
  33  33   0   0  | NN->TO if Pos:NN@[-1] & Word:to@[0]
  26  26   0   0  | IN->. if Pos:NNS@[-1] & Word:.@[0]
  24  24   0   0  | IN->, if Pos:NNS@[-1] & Word:,@[0]
  22  27   5  24  | NN->-NONE- if Pos:VBD@[-1]
  17  17   0   0  | NN->CC if Pos:NN@[-1] & Word:and@[0]
>>> tagger1.rules()[1:3]
(Rule('001', 'NN', ',', [(Pos([-1]),'NN'), (Word([0]),',')]), Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]))
>>> train_stats = tagger1.train_stats()
>>> [train_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1775, 1269, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]]
>>> tagger1.print_template_statistics(printunused=False)
TEMPLATE STATISTICS (TRAIN)  2 templates, 10 rules)
TRAIN (   2417 tokens) initial  1775 0.2656 final:  1269 0.4750
#ID | Score (train) |  #Rules     | Template
--------------------------------------------
001 |   305   0.603 |   7   0.700 | Template(Pos([-1]),Word([0]))
000 |   201   0.397 |   3   0.300 | Template(Pos([-1]))
<BLANKLINE>
<BLANKLINE>
>>> tagger1.evaluate(gold_data) # doctest: +ELLIPSIS
0.43996...
>>> tagged, test_stats = tagger1.batch_tag_incremental(testing_data, gold_data)
>>> tagged[33][12:] == [('foreign', 'IN'), ('debt', 'NN'), ('of', 'IN'), ('$', 'NN'), ('64', 'CD'),
... ('billion', 'NN'), ('*U*', 'NN'), ('--', 'NN'), ('the', 'DT'), ('third-highest', 'NN'), ('in', 'NN'),
... ('the', 'DT'), ('developing', 'VBG'), ('world', 'NN'), ('.', '.')]
True
>>> [test_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
[1855, 1376, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]]

# a high-accuracy tagger >>> tagger2 = tt.train(training_data, max_rules=10, min_acc=0.99) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: 0.99) Finding initial useful rules...

Found 845 useful rules.
<BLANKLINE>
B |

S F r O | Score = Fixed - Broken c i o t | R Fixed = num tags changed incorrect -> correct o x k h | u Broken = num tags changed correct -> incorrect r e e e | l Other = num tags changed incorrect -> incorrect e d n r | e

------------------+-------------------------------------------------------
132 132 0 0 | AT->DT if Pos:NN@[-1]
85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] 36 36 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | NN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | NN->, if Pos:NNS@[-1] & Word:,@[0] 19 19 0 6 | NN->VB if Pos:TO@[-1] 18 18 0 0 | CD->-NONE- if Pos:NN@[-1] & Word:0@[0] 18 18 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0]
>>> tagger2.evaluate(gold_data)  # doctest: +ELLIPSIS
0.44159544...
>>> tagger2.rules()[2:4]
(Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]), Rule('001', 'NN', 'IN', [(Pos([-1]),'NN'), (Word([0]),'of')]))

# NOTE1: (!!FIXME) A far better baseline uses nltk.tag.UnigramTagger, # with a RegexpTagger only as backoff. For instance, # >>> baseline = UnigramTagger(baseline_data, backoff=backoff) # However, as of Nov 2013, nltk.tag.UnigramTagger does not yield consistent results # between python versions. The simplistic backoff above is a workaround to make doctests # get consistent input.

Parameters
train_sents:list(list(tuple))training data
max_rules:intoutput at most max_rules rules
min_score:intstop training when no rules better than min_score can be found
min_acc:float or Nonediscard any rule with lower accuracy than min_acc
Returns
BrillTaggerthe learned tagger
def _apply_rule(self, rule, test_sents): (source)

Update test_sents by applying rule everywhere where its conditions are met.

def _best_rule(self, train_sents, test_sents, min_score, min_acc): (source)

Find the next best rule. This is done by repeatedly taking a rule with the highest score and stepping through the corpus to see where it applies. When it makes an error (decreasing its score) it's bumped down, and we try a new rule with the highest score. When we find a rule which has the highest score and which has been tested against the entire corpus, we can conclude that it's the next best rule.

def _clean(self): (source)

Undocumented

def _find_rules(self, sent, wordnum, new_tag): (source)

Use the templates to find rules that apply at index wordnum in the sentence sent and generate the tag new_tag.

def _init_mappings(self, test_sents, train_sents): (source)

Initialize the tag position mapping & the rule related mappings. For each error in test_sents, find new rules that would correct them, and add them to the rule mappings.

def _trace_apply(self, num_updates): (source)

Undocumented

def _trace_header(self): (source)

Undocumented

def _trace_rule(self, rule): (source)

Undocumented

def _trace_update_rules(self, num_obsolete, num_new, num_unseen): (source)

Undocumented

def _update_rule_applies(self, rule, sentnum, wordnum, train_sents): (source)

Update the rule data tables to reflect the fact that rule applies at the position (sentnum, wordnum).

def _update_rule_not_applies(self, rule, sentnum, wordnum): (source)

Update the rule data tables to reflect the fact that rule does not apply at the position (sentnum, wordnum).

def _update_rules(self, rule, train_sents, test_sents): (source)

Check if we should add or remove any rules from consideration, given the changes made by rule.

def _update_tag_positions(self, rule): (source)

Update _tag_positions to reflect the changes to tags that are made by rule.

_deterministic = (source)

Undocumented

_first_unknown_position = (source)

Mapping from rules to the first position where we're unsure if the rule applies. This records the next position we need to check to see if the rule messed anything up.

_initial_tagger = (source)

Undocumented

_positions_by_rule = (source)

Mapping from rule to position to effect, specifying the effect that each rule has on the overall score, at each position. Position is (sentnum, wordnum); and effect is -1, 0, or 1. As with _rules_by_position, this mapping starts out only containing rules with positive effects; but when we examine a rule, we'll extend this mapping to include the positions where the rule is harmful or neutral.

_rule_scores = (source)

Mapping from rules to upper bounds on their effects on the overall score. This is the inverse mapping to _rules_by_score. Invariant: ruleScores[r] = sum(_positions_by_rule[r])

_ruleformat = (source)

Undocumented

_rules_by_position = (source)

Mapping from positions to the set of rules that are known to occur at that position. Position is (sentnum, wordnum). Initially, this will only contain positions where each rule applies in a helpful way; but when we examine a rule, we'll extend this list to also include positions where each rule applies in a harmful or neutral way.

_rules_by_score = (source)

Mapping from scores to the set of rules whose effect on the overall score is upper bounded by that score. Invariant: rulesByScore[s] will contain r iff the sum of _positions_by_rule[r] is s.

_tag_positions = (source)

Mapping from tags to lists of positions that use that tag.

_templates = (source)

Undocumented

Undocumented