nltk.tag.crf.CRFTagger

class documentation

class CRFTagger(TaggerI): (source)

Constructor: CRFTagger(feature_func, verbose, training_opt)

A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite

>>> from nltk.tag import CRFTagger
>>> ct = CRFTagger()

>>> train_data = [[('University','Noun'), ('is','Verb'), ('a','Det'), ('good','Adj'), ('place','Noun')],
... [('dog','Noun'),('eat','Verb'),('meat','Noun')]]

>>> ct.train(train_data,'model.crf.tagger')
>>> ct.tag_sents([['dog','is','good'], ['Cat','eat','meat']])
[[('dog', 'Noun'), ('is', 'Verb'), ('good', 'Adj')], [('Cat', 'Noun'), ('eat', 'Verb'), ('meat', 'Noun')]]

>>> gold_sentences = [[('dog','Noun'),('is','Verb'),('good','Adj')] , [('Cat','Noun'),('eat','Verb'), ('meat','Noun')]]
>>> ct.evaluate(gold_sentences)
1.0

Setting learned model file >>> ct = CRFTagger() >>> ct.set_model_file('model.crf.tagger') >>> ct.evaluate(gold_sentences) 1.0

Method	`__init__`	Initialize the CRFSuite tagger :param feature_func: The function that extracts features for each token of a sentence. This function should take 2 parameters: tokens and index which extract features at index position from tokens list...
Method	`set_model_file`	Undocumented
Method	`tag`	Train a new model using ``train'' function
Method	`tag_sents`	Train a new model using ``train'' function
Method	`train`	Train the CRF tagger using CRFSuite :params train_data : is the list of annotated sentences. :type train_data : list (list(tuple(str,str))) :params model_file : the model will be saved to this file.
Method	`_get_features`	Current Word
Instance Variable	`_feature_func`	Undocumented
Instance Variable	`_model_file`	Undocumented
Instance Variable	`_pattern`	Undocumented
Instance Variable	`_tagger`	Undocumented
Instance Variable	`_training_options`	Undocumented
Instance Variable	`_verbose`	Undocumented

Inherited from TaggerI:

Method	`evaluate`	Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.
Method	`_check_params`	Undocumented

def __init__(self, feature_func=None, verbose=False, training_opt={}): (source) ¶

Initialize the CRFSuite tagger :param feature_func: The function that extracts features for each token of a sentence. This function should take 2 parameters: tokens and index which extract features at index position from tokens list. See the build in _get_features function for more detail. :param verbose: output the debugging messages during training. :type verbose: boolean :param training_opt: python-crfsuite training options :type training_opt : dictionary

Set of possible training options (using LBFGS training algorithm).

'feature.minfreq' : The minimum frequency of features. 'feature.possible_states' : Force to generate possible state features. 'feature.possible_transitions' : Force to generate possible transition features. 'c1' : Coefficient for L1 regularization. 'c2' : Coefficient for L2 regularization. 'max_iterations' : The maximum number of iterations for L-BFGS optimization. 'num_memories' : The number of limited memories for approximating the inverse hessian matrix. 'epsilon' : Epsilon for testing the convergence of the objective. 'period' : The duration of iterations to test the stopping criterion. 'delta' : The threshold for the stopping criterion; an L-BFGS iteration stops when the

improvement of the log likelihood over the last ${period} iterations is no greater than this threshold.

'linesearch' : The line search algorithm used in L-BFGS updates:

{ 'MoreThuente': More and Thuente's method,: 'Backtracking': Backtracking method with regular Wolfe condition, 'StrongBacktracking': Backtracking method with strong Wolfe condition

}

'max_linesearch' : The maximum number of trials for the line search algorithm.

def set_model_file(self, model_file): (source) ¶

Undocumented

def tag(self, tokens): (source) ¶

overrides nltk.tag.api.TaggerI.tag

Tag a sentence using Python CRFSuite Tagger. NB before using this function, user should specify the mode_file either by

Train a new model using ``train'' function
Use the pre-trained model which is set via ``set_model_file'' function

:params tokens : list of tokens needed to tag. :type tokens : list(str) :return : list of tagged tokens. :rtype : list (tuple(str,str))

def tag_sents(self, sents): (source) ¶

overrides nltk.tag.api.TaggerI.tag_sents

Tag a list of sentences. NB before using this function, user should specify the mode_file either by

Train a new model using ``train'' function
Use the pre-trained model which is set via ``set_model_file'' function

:params sentences : list of sentences needed to tag. :type sentences : list(list(str)) :return : list of tagged sentences. :rtype : list (list (tuple(str,str)))

def train(self, train_data, model_file): (source) ¶

Train the CRF tagger using CRFSuite :params train_data : is the list of annotated sentences. :type train_data : list (list(tuple(str,str))) :params model_file : the model will be saved to this file.

def _get_features(self, tokens, idx): (source) ¶

Extract basic features about this word including

Current Word
Is Capitalized ?
Has Punctuation ?
Has Number ?
Suffixes up to length 3

Note that : we might include feature over previous word, next word ect.

:return : a list which contains the features :rtype : list(str)

_feature_func = (source) ¶

Undocumented

_model_file = (source) ¶

Undocumented

_pattern = (source) ¶

Undocumented

_tagger = (source) ¶

Undocumented

_training_options = (source) ¶

Undocumented

_verbose = (source) ¶

Undocumented