nltk.tbl.demo

module documentation

(source)

Undocumented

Function	`corpus_size`	Undocumented
Function	`demo`	Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions.
Function	`demo_error_analysis`	Writes a file with context for each erroneous word after tagging testing data
Function	`demo_generated_templates`	Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details.
Function	`demo_high_accuracy_rules`	Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human.
Function	`demo_learning_curve`	Plot a learning curve -- the contribution on tagging accuracy of the individual rules. Note: requires matplotlib
Function	`demo_multifeature_template`	Templates can have more than a single feature.
Function	`demo_multiposition_feature`	The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right.
Function	`demo_repr_rule_format`	Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose"))
Function	`demo_serialize_tagger`	Serializes the learned tagger to a file in pickle format; reloads it and validates the process.
Function	`demo_str_rule_format`	Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose"))
Function	`demo_template_statistics`	Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined.
Function	`demo_verbose_rule_format`	Exemplify Rule.format("verbose")
Function	`postag`	Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template
Constant	`NN_CD_TAGGER`	Undocumented
Constant	`REGEXP_TAGGER`	Undocumented
Function	`_demo_plot`	Undocumented
Function	`_demo_prepare_data`	Undocumented

def corpus_size(seqs): (source) ¶

Undocumented

def demo(): (source) ¶

Run a demo with defaults. See source comments for details, or docstrings of any of the more specific demo_* functions.

def demo_error_analysis(): (source) ¶

Writes a file with context for each erroneous word after tagging testing data

def demo_generated_templates(): (source) ¶

Template.expand and Feature.expand are class methods facilitating generating large amounts of templates. See their documentation for details.

Note: training with 500 templates can easily fill all available even on relatively small corpora

def demo_high_accuracy_rules(): (source) ¶

Discard rules with low accuracy. This may hurt performance a bit, but will often produce rules which are more interesting read to a human.

def demo_learning_curve(): (source) ¶

Plot a learning curve -- the contribution on tagging accuracy of the individual rules. Note: requires matplotlib

def demo_multifeature_template(): (source) ¶

Templates can have more than a single feature.

def demo_multiposition_feature(): (source) ¶

The feature/s of a template takes a list of positions relative to the current word where the feature should be looked for, conceptually joined by logical OR. For instance, Pos([-1, 1]), given a value V, will hold whenever V is found one step to the left and/or one step to the right.

For contiguous ranges, a 2-arg form giving inclusive end points can also be used: Pos(-3, -1) is the same as the arg below.

def demo_repr_rule_format(): (source) ¶

Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose"))

def demo_serialize_tagger(): (source) ¶

Serializes the learned tagger to a file in pickle format; reloads it and validates the process.

def demo_str_rule_format(): (source) ¶

Exemplify repr(Rule) (see also str(Rule) and Rule.format("verbose"))

def demo_template_statistics(): (source) ¶

Show aggregate statistics per template. Little used templates are candidates for deletion, much used templates may possibly be refined.

Deleting unused templates is mostly about saving time and/or space: training is basically O(T) in the number of templates T (also in terms of memory usage, which often will be the limiting factor).

def demo_verbose_rule_format(): (source) ¶

Exemplify Rule.format("verbose")

def postag(templates=None, tagged_data=None, num_sents=1000, max_rules=300, min_score=3, min_acc=None, train=0.8, trace=3, randomize=False, ruleformat='str', incremental_stats=False, template_stats=False, error_output=None, serialize_output=None, learning_curve_output=None, learning_curve_take=300, baseline_backoff_tagger=None, separate_baseline_data=False, cache_baseline_tagger=None): (source) ¶

Brill Tagger Demonstration :param templates: how many sentences of training and testing data to use :type templates: list of Template

Note on separate_baseline_data: if True, reuse training data both for baseline and rule learner. This is fast and fine for a demo, but is likely to generalize worse on unseen data. Also cannot be sensibly used for learning curves on training data (the baseline will be artificially high).

Parameters
templates	Undocumented
tagged_data:C{int}	maximum number of rule instances to create
num_sents:C{int}	how many sentences of training and testing data to use
max_rules:C{int}	maximum number of rule instances to create
min_score:C{int}	the minimum score for a rule in order for it to be considered
min_acc:C{float}	the minimum score for a rule in order for it to be considered
train:C{float}	the fraction of the the corpus to be used for training (1=all)
trace:C{int}	the level of diagnostic tracing output to produce (0-4)
randomize:C{bool}	whether the training data should be a random subset of the corpus
ruleformat:C{str}	rule output format, one of "str", "repr", "verbose"
incremental_stats:C{bool}	if true, will tag incrementally and collect stats for each rule (rather slow)
template_stats:C{bool}	if true, will print per-template statistics collected in training and (optionally) testing
error_output:C{string}	the file where errors will be saved
serialize_output:C{string}	the file where the learned tbl tagger will be saved
learning_curve_output:C{string}	filename of plot of learning curve(s) (train and also test, if available)
learning_curve_take:C{int}	how many rules plotted
baseline_backoff_tagger:tagger	the file where rules will be saved
separate_baseline_data:C{bool}	use a fraction of the training data exclusively for training baseline
cache_baseline_tagger:C{string}	cache baseline tagger to this file (only interesting as a temporary workaround to get deterministic output from the baseline unigram tagger between python versions)

NN_CD_TAGGER = (source) ¶

Undocumented

Value

RegexpTagger([('^-?[0-9]+(.[0-9]+)?$', 'CD'), ('.*', 'NN')])

REGEXP_TAGGER = (source) ¶

Undocumented

Value

RegexpTagger([('^-?[0-9]+(.[0-9]+)?$', 'CD'),
              ('(The|the|A|a|An|an)$', 'AT'),
              ('.*able$', 'JJ'),
              ('.*ness$', 'NN'),
              ('.*ly$', 'RB'),
              ('.*s$', 'NNS'),
              ('.*ing$', 'VBG'),
...

def _demo_plot(learning_curve_output, teststats, trainstats=None, take=None): (source) ¶

Undocumented

def _demo_prepare_data(tagged_data, train, num_sents, randomize, separate_baseline_data): (source) ¶

Undocumented