class RegexpTagger(SequentialBackoffTagger): (source)
Constructor: RegexpTagger(regexps, backoff)
Regular Expression Tagger
The RegexpTagger assigns tags to tokens by comparing their word strings to a series of regular expressions. The following tagger uses word suffixes to make guesses about the correct Brown Corpus part of speech tag:
>>> from nltk.corpus import brown >>> from nltk.tag import RegexpTagger >>> test_sent = brown.sents(categories='news')[0] >>> regexp_tagger = RegexpTagger( ... [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers ... (r'(The|the|A|a|An|an)$', 'AT'), # articles ... (r'.*able$', 'JJ'), # adjectives ... (r'.*ness$', 'NN'), # nouns formed from adjectives ... (r'.*ly$', 'RB'), # adverbs ... (r'.*s$', 'NNS'), # plural nouns ... (r'.*ing$', 'VBG'), # gerunds ... (r'.*ed$', 'VBD'), # past tense verbs ... (r'.*', 'NN') # nouns (default) ... ]) >>> regexp_tagger <Regexp Tagger: size=9> >>> regexp_tagger.tag(test_sent) [('The', 'AT'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'), ('said', 'NN'), ('Friday', 'NN'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'NN'), ("Atlanta's", 'NNS'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', 'NN'), ('no', 'NN'), ('evidence', 'NN'), ("''", 'NN'), ('that', 'NN'), ('any', 'NN'), ('irregularities', 'NNS'), ('took', 'NN'), ('place', 'NN'), ('.', 'NN')]
Parameters | |
regexps | A list of (regexp, tag) pairs, each of which indicates that a word matching regexp should be tagged with tag. The pairs will be evalutated in order. If none of the regexps match a word, then the optional backoff tagger is invoked, else it is assigned the tag None. |
Class Method | decode |
Undocumented |
Method | __init__ |
No summary |
Method | __repr__ |
Undocumented |
Method | choose |
Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None -- do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger. |
Method | encode |
Undocumented |
Class Variable | json |
Undocumented |
Instance Variable | _regexps |
Undocumented |
Inherited from SequentialBackoffTagger
:
Method | tag |
Determine the most appropriate tag sequence for the given token sequence, and return a corresponding list of tagged tokens. A tagged token is encoded as a tuple (token, tag). |
Method | tag |
Determine an appropriate tag for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted. |
Property | backoff |
The backoff tagger for this tagger. |
Instance Variable | _taggers |
A list of all the taggers that should be tried to tag a token (i.e., self and its backoff taggers). |
Inherited from TaggerI
(via SequentialBackoffTagger
):
Method | evaluate |
Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score. |
Method | tag |
Apply self.tag() to each element of sentences. I.e.: |
Method | _check |
Undocumented |
Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None -- do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
Parameters | |
tokens:list | The list of words that are being tagged. |
index:int | The index of the word whose tag should be returned. |
history:list(str) | A list of the tags for all words before index. |
Returns | |
str | Undocumented |