nltk.tag

package documentation

(source)

NLTK Taggers

This package contains classes and interfaces for part-of-speech tagging, or simply "tagging".

A "tag" is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

>>> tagged_tok = ('fly', 'NN')

An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:

>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

A Russian tagger is also available if you specify lang="rus". It uses the Russian National Corpus tagset:

>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus')    # doctest: +SKIP
[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),
('бумажку', 'S'), ('.', 'NONLEX')]

This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

Note that words that the tagger has not seen during training receive a tag of None.

We evaluate a tagger on data that was not seen during training:

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.7...

For more information, please consult chapter 5 of the NLTK Book.

Module	`api`	Interface for tagging each token in a sentence with supplementary information, such as its part of speech.
Module	`brill`	No module docstring; 5/5 functions, 3/3 classes documented
Module	`brill_trainer`	No module docstring; 1/1 class documented
Module	`crf`	A module for POS tagging using CRFSuite
Module	`hmm`	Hidden Markov Models (HMMs) largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence. These models are finite state machines characterised by a number of states, transitions between these states, and output symbols emitted while in each state...
Module	`hunpos`	A module for interfacing with the HunPos open-source POS-tagger.
Module	`mapping`	Interface for converting POS tags from various treebanks to the universal tagset of Petrov, Das, & McDonald.
Module	`perceptron`	No module docstring; 0/1 constant, 0/3 function, 2/2 classes documented
Module	`senna`	Senna POS tagger, NER Tagger, Chunk Tagger
Module	`sequential`	Classes for tagging sentences sequentially, left to right. The abstract base class SequentialBackoffTagger serves as the base class for all the taggers in this module. Tagging of individual words is performed by the method ...
Module	`stanford`	A module for interfacing with the Stanford taggers.
Module	`tnt`	Implementation of 'TnT - A Statisical Part of Speech Tagger' by Thorsten Brants
Module	`util`	No module docstring; 3/3 functions documented

From __init__.py:

Function	`pos_tag`	Use NLTK's currently recommended part of speech tagger to tag the given list of tokens.
Function	`pos_tag_sents`	Use NLTK's currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.
Constant	`RUS_PICKLE`	Undocumented
Function	`_get_tagger`	Undocumented
Function	`_pos_tag`	Undocumented

def pos_tag(tokens, tagset=None, lang='eng'): (source) ¶

Use NLTK's currently recommended part of speech tagger to tag the given list of tokens.

>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

NB. Use pos_tag_sents() for efficient tagging of more than one sentence.

Parameters
tokens:list(str)	Sequence of tokens to be tagged
tagset:str	the tagset to be used, e.g. universal, wsj, brown
lang:str	the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
Returns
list(tuple(str, str))	The tagged tokens

def pos_tag_sents(sentences, tagset=None, lang='eng'): (source) ¶

Use NLTK's currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.

Parameters
sentences:list(list(str))	List of sentences to be tagged
tagset:str	the tagset to be used, e.g. universal, wsj, brown
lang:str	the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
Returns
list(list(tuple(str, str)))	The list of tagged sentences

RUS_PICKLE: str = (source) ¶

Undocumented

Value

'taggers/averaged_perceptron_tagger_ru/averaged_perceptron_tagger_ru.pickle'

def _get_tagger(lang=None): (source) ¶

Undocumented

def _pos_tag(tokens, tagset=None, tagger=None, lang=None): (source) ¶

Undocumented