NLTK Taggers
This package contains classes and interfaces for part-of-speech tagging, or simply "tagging".
A "tag" is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):
>>> tagged_tok = ('fly', 'NN')
An off-the-shelf tagger is available for English. It uses the Penn Treebank tagset:
>>> from nltk import pos_tag, word_tokenize >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
A Russian tagger is also available if you specify lang="rus". It uses the Russian National Corpus tagset:
>>> pos_tag(word_tokenize("Илья оторопел и дважды перечитал бумажку."), lang='rus') # doctest: +SKIP [('Илья', 'S'), ('оторопел', 'V'), ('и', 'CONJ'), ('дважды', 'ADV'), ('перечитал', 'V'), ('бумажку', 'S'), ('.', 'NONLEX')]
This package defines several taggers, which take a list of tokens, assign a tag to each one, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:
>>> from nltk.corpus import brown >>> from nltk.tag import UnigramTagger >>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500]) >>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment'] >>> for word, tag in tagger.tag(sent): ... print(word, '->', tag) Mitchell -> NP decried -> None the -> AT high -> JJ rate -> NN of -> IN unemployment -> None
Note that words that the tagger has not seen during training receive a tag of None.
We evaluate a tagger on data that was not seen during training:
>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600]) 0.7...
For more information, please consult chapter 5 of the NLTK Book.
Module | api |
Interface for tagging each token in a sentence with supplementary information, such as its part of speech. |
Module | brill |
No module docstring; 5/5 functions, 3/3 classes documented |
Module | brill |
No module docstring; 1/1 class documented |
Module | crf |
A module for POS tagging using CRFSuite |
Module | hmm |
Hidden Markov Models (HMMs) largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence. These models are finite state machines characterised by a number of states, transitions between these states, and output symbols emitted while in each state... |
Module | hunpos |
A module for interfacing with the HunPos open-source POS-tagger. |
Module | mapping |
Interface for converting POS tags from various treebanks to the universal tagset of Petrov, Das, & McDonald. |
Module | perceptron |
No module docstring; 0/1 constant, 0/3 function, 2/2 classes documented |
Module | senna |
Senna POS tagger, NER Tagger, Chunk Tagger |
Module | sequential |
Classes for tagging sentences sequentially, left to right. The abstract base class SequentialBackoffTagger serves as the base class for all the taggers in this module. Tagging of individual words is performed by the method ... |
Module | stanford |
A module for interfacing with the Stanford taggers. |
Module | tnt |
Implementation of 'TnT - A Statisical Part of Speech Tagger' by Thorsten Brants |
Module | util |
No module docstring; 3/3 functions documented |
From __init__.py
:
Function | pos |
Use NLTK's currently recommended part of speech tagger to tag the given list of tokens. |
Function | pos |
Use NLTK's currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens. |
Constant | RUS |
Undocumented |
Function | _get |
Undocumented |
Function | _pos |
Undocumented |
Use NLTK's currently recommended part of speech tagger to tag the given list of tokens.
>>> from nltk.tag import pos_tag >>> from nltk.tokenize import word_tokenize >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')] >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal') [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'), ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
NB. Use pos_tag_sents()
for efficient tagging of more than one sentence.
Parameters | |
tokens:list(str) | Sequence of tokens to be tagged |
tagset:str | the tagset to be used, e.g. universal, wsj, brown |
lang:str | the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian |
Returns | |
list(tuple(str, str)) | The tagged tokens |
Use NLTK's currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.
Parameters | |
sentences:list(list(str)) | List of sentences to be tagged |
tagset:str | the tagset to be used, e.g. universal, wsj, brown |
lang:str | the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian |
Returns | |
list(list(tuple(str, str))) | The list of tagged sentences |