module documentation

Undocumented

Function corpus_size Undocumented
Function demo Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions.
Function demo_error_analysis Writes a file with context for each erroneous word after tagging testing data
Function demo_generated_templates Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details.
Function demo_high_accuracy_rules Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human.
Function demo_learning_curve Plot a learning curve -- the contribution on tagging accuracy of the individual rules. Note: requires matplotlib
Function demo_multifeature_template Templates can have more than a single feature.
Function demo_multiposition_feature The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right.
Function demo_repr_rule_format Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose"))
Function demo_serialize_tagger Serializes the learned tagger to a file in pickle format; reloads it and validates the process.
Function demo_str_rule_format Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose"))
Function demo_template_statistics Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined.
Function demo_verbose_rule_format Exemplify Rule.format("verbose")
Function postag Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template
Constant NN_CD_TAGGER Undocumented
Constant REGEXP_TAGGER Undocumented
Function _demo_plot Undocumented
Function _demo_prepare_data Undocumented
def corpus_size(seqs): (source)

Undocumented

def demo(): (source)

Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions.

def demo_error_analysis(): (source)

Writes a file with context for each erroneous word after tagging testing data

def demo_generated_templates(): (source)

Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details.

Note: training with 500 templates can easily fill all available even on relatively small corpora

def demo_high_accuracy_rules(): (source)

Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human.

def demo_learning_curve(): (source)

Plot a learning curve -- the contribution on tagging accuracy of the individual rules. Note: requires matplotlib

def demo_multifeature_template(): (source)

Templates can have more than a single feature.

def demo_multiposition_feature(): (source)

The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right.

For contiguous ranges, a 2-arg form giving inclusive end points can also be used: Pos(-3, -1) is the same as the arg below.

def demo_repr_rule_format(): (source)

Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose"))

def demo_serialize_tagger(): (source)

Serializes the learned tagger to a file in pickle format; reloads it and validates the process.

def demo_str_rule_format(): (source)

Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose"))

def demo_template_statistics(): (source)

Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined.

Deleting unused templates is mostly about saving time and/or space: training is basically O(T) in the number of templates T (also in terms of memory usage, which often will be the limiting factor).

def demo_verbose_rule_format(): (source)

Exemplify Rule.format("verbose")

def postag(templates=None, tagged_data=None, num_sents=1000, max_rules=300, min_score=3, min_acc=None, train=0.8, trace=3, randomize=False, ruleformat='str', incremental_stats=False, template_stats=False, error_output=None, serialize_output=None, learning_curve_output=None, learning_curve_take=300, baseline_backoff_tagger=None, separate_baseline_data=False, cache_baseline_tagger=None): (source)

Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template

Note on separate_baseline_data: if True, reuse training data both for baseline and rule learner. This is fast and fine for a demo, but is likely to generalize worse on unseen data. Also cannot be sensibly used for learning curves on training data (the baseline will be artificially high).

Parameters
templatesUndocumented
tagged_data:C{int}maximum number of rule instances to create
num_sents:C{int}how many sentences of training and testing data to use
max_rules:C{int}maximum number of rule instances to create
min_score:C{int}the minimum score for a rule in order for it to be considered
min_acc:C{float}the minimum score for a rule in order for it to be considered
train:C{float}the fraction of the the corpus to be used for training (1=all)
trace:C{int}the level of diagnostic tracing output to produce (0-4)
randomize:C{bool}whether the training data should be a random subset of the corpus
ruleformat:C{str}rule output format, one of "str", "repr", "verbose"
incremental_stats:C{bool}if true, will tag incrementally and collect stats for each rule (rather slow)
template_stats:C{bool}if true, will print per-template statistics collected in training and (optionally) testing
error_output:C{string}the file where errors will be saved
serialize_output:C{string}the file where the learned tbl tagger will be saved
learning_curve_output:C{string}filename of plot of learning curve(s) (train and also test, if available)
learning_curve_take:C{int}how many rules plotted
baseline_backoff_tagger:taggerthe file where rules will be saved
separate_baseline_data:C{bool}use a fraction of the training data exclusively for training baseline
cache_baseline_tagger:C{string}cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions)
NN_CD_TAGGER = (source)

Undocumented

Value
RegexpTagger([('^-?[0-9]+(.[0-9]+)?$', 'CD'), ('.*', 'NN')])
REGEXP_TAGGER = (source)

Undocumented

Value
RegexpTagger([('^-?[0-9]+(.[0-9]+)?$', 'CD'),
              ('(The|the|A|a|An|an)$', 'AT'),
              ('.*able$', 'JJ'),
              ('.*ness$', 'NN'),
              ('.*ly$', 'RB'),
              ('.*s$', 'NNS'),
              ('.*ing$', 'VBG'),
...
def _demo_plot(learning_curve_output, teststats, trainstats=None, take=None): (source)

Undocumented

def _demo_prepare_data(tagged_data, train, num_sents, randomize, separate_baseline_data): (source)

Undocumented