class documentation

Regular Expression Tagger

The RegexpTagger assigns tags to tokens by comparing their word strings to a series of regular expressions. The following tagger uses word suffixes to make guesses about the correct Brown Corpus part of speech tag:

>>> from nltk.corpus import brown
>>> from nltk.tag import RegexpTagger
>>> test_sent = brown.sents(categories='news')[0]
>>> regexp_tagger = RegexpTagger(
...     [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
...      (r'(The|the|A|a|An|an)$', 'AT'),   # articles
...      (r'.*able$', 'JJ'),                # adjectives
...      (r'.*ness$', 'NN'),                # nouns formed from adjectives
...      (r'.*ly$', 'RB'),                  # adverbs
...      (r'.*s$', 'NNS'),                  # plural nouns
...      (r'.*ing$', 'VBG'),                # gerunds
...      (r'.*ed$', 'VBD'),                 # past tense verbs
...      (r'.*', 'NN')                      # nouns (default)
... ])
>>> regexp_tagger
<Regexp Tagger: size=9>
>>> regexp_tagger.tag(test_sent)
[('The', 'AT'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'),
('said', 'NN'), ('Friday', 'NN'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'NN'),
("Atlanta's", 'NNS'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'),
('produced', 'VBD'), ('``', 'NN'), ('no', 'NN'), ('evidence', 'NN'), ("''", 'NN'),
('that', 'NN'), ('any', 'NN'), ('irregularities', 'NNS'), ('took', 'NN'),
('place', 'NN'), ('.', 'NN')]
Parameters
regexpsA list of (regexp, tag) pairs, each of which indicates that a word matching regexp should be tagged with tag. The pairs will be evalutated in order. If none of the regexps match a word, then the optional backoff tagger is invoked, else it is assigned the tag None.
Class Method decode_json_obj Undocumented
Method __init__ No summary
Method __repr__ Undocumented
Method choose_tag Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None -- do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
Method encode_json_obj Undocumented
Class Variable json_tag Undocumented
Instance Variable _regexps Undocumented

Inherited from SequentialBackoffTagger:

Method tag Determine the most appropriate tag sequence for the given token sequence, and return a corresponding list of tagged tokens. A tagged token is encoded as a tuple (token, tag).
Method tag_one Determine an appropriate tag for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.
Property backoff The backoff tagger for this tagger.
Instance Variable _taggers A list of all the taggers that should be tried to tag a token (i.e., self and its backoff taggers).

Inherited from TaggerI (via SequentialBackoffTagger):

Method evaluate Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.
Method tag_sents Apply self.tag() to each element of sentences. I.e.:
Method _check_params Undocumented
@classmethod
def decode_json_obj(cls, obj): (source)

Undocumented

def __init__(self, regexps, backoff=None): (source)
def __repr__(self): (source)

Undocumented

def choose_tag(self, tokens, index, history): (source)

Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None -- do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.

Parameters
tokens:listThe list of words that are being tagged.
index:intThe index of the word whose tag should be returned.
history:list(str)A list of the tags for all words before index.
Returns
strUndocumented
def encode_json_obj(self): (source)

Undocumented

json_tag: str = (source)

Undocumented

_regexps = (source)

Undocumented