package documentation

NLTK Taggers

This package contains classes and interfaces for part-of-speech tagging, or simply "tagging".

A "tag" is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

>>> tagged_tok = ('fly', 'NN')

An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:

>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]

A Russian tagger is also available if you specify lang="rus". It uses the Russian National Corpus tagset:

>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus')    # doctest: +SKIP
[('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'),
('бумажку', 'S'), ('.', 'NONLEX')]

This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

Note that words that the tagger has not seen during training receive a tag of None.

We evaluate a tagger on data that was not seen during training:

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.7...

For more information, please consult chapter 5 of the NLTK Book.

Module api Interface for tagging each token in a sentence with supplementary information, such as its part of speech.
Module brill No module docstring; 5/5 functions, 3/3 classes documented
Module brill_trainer No module docstring; 1/1 class documented
Module crf A module for POS tagging using CRFSuite
Module hmm Hidden Markov Models (HMMs) largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence. These models are finite state machines characterised by a number of states, transitions between these states, and output symbols emitted while in each state...
Module hunpos A module for interfacing with the HunPos open-source POS-tagger.
Module mapping Interface for converting POS tags from various treebanks to the universal tagset of Petrov, Das, & McDonald.
Module perceptron No module docstring; 0/1 constant, 0/3 function, 2/2 classes documented
Module senna Senna POS tagger, NER Tagger, Chunk Tagger
Module sequential Classes for tagging sentences sequentially, left to right. The abstract base class SequentialBackoffTagger serves as the base class for all the taggers in this module. Tagging of individual words is performed by the method ...
Module stanford A module for interfacing with the Stanford taggers.
Module tnt Implementation of 'TnT - A Statisical Part of Speech Tagger' by Thorsten Brants
Module util No module docstring; 3/3 functions documented

From __init__.py:

Function pos_tag Use NLTK's currently recommended part of speech tagger to tag the given list of tokens.
Function pos_tag_sents Use NLTK's currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.
Constant RUS_PICKLE Undocumented
Function _get_tagger Undocumented
Function _pos_tag Undocumented
def pos_tag(tokens, tagset=None, lang='eng'): (source)

Use NLTK's currently recommended part of speech tagger to tag the given list of tokens.

>>> from nltk.tag import pos_tag
>>> from nltk.tokenize import word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
[('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]

NB. Use pos_tag_sents() for efficient tagging of more than one sentence.

Parameters
tokens:list(str)Sequence of tokens to be tagged
tagset:strthe tagset to be used, e.g. universal, wsj, brown
lang:strthe ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
Returns
list(tuple(str, str))The tagged tokens
def pos_tag_sents(sentences, tagset=None, lang='eng'): (source)

Use NLTK's currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.

Parameters
sentences:list(list(str))List of sentences to be tagged
tagset:strthe tagset to be used, e.g. universal, wsj, brown
lang:strthe ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
Returns
list(list(tuple(str, str)))The list of tagged sentences
RUS_PICKLE: str = (source)

Undocumented

Value
'taggers/averaged_perceptron_tagger_ru/averaged_perceptron_tagger_ru.pickle'
def _get_tagger(lang=None): (source)

Undocumented

def _pos_tag(tokens, tagset=None, tagger=None, lang=None): (source)

Undocumented