nltk.tokenize.punkt.PunktBaseClass

class documentation

class PunktBaseClass(object): (source)

Known subclasses: nltk.tokenize.punkt.PunktSentenceTokenizer, nltk.tokenize.punkt.PunktTrainer

Constructor: PunktBaseClass(lang_vars, token_cls, params)

Includes common components of PunktTrainer and PunktSentenceTokenizer.

Method	`__init__`	Undocumented
Method	`_annotate_first_pass`	Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:
Method	`_first_pass_annotation`	Performs type-based annotation on a single token.
Method	`_tokenize_words`	Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.
Instance Variable	`_lang_vars`	Undocumented
Instance Variable	`_params`	Undocumented
Instance Variable	`_Token`	The collection of parameters that determines the behavior of the punkt tokenizer.

def __init__(self, lang_vars=None, token_cls=PunktToken, params=None): (source) ¶

overridden in nltk.tokenize.punkt.PunktSentenceTokenizer, nltk.tokenize.punkt.PunktTrainer

Undocumented

def _annotate_first_pass(self, tokens): (source) ¶

Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:

'?', '!', and '.' are marked as sentence breaks.

sequences of two or more periods are marked as ellipsis.

any word ending in '.' that's a known abbreviation is marked as an abbreviation.

any other word ending in '.' is marked as a sentence break.

Return these annotations as a tuple of three sets:

sentbreak_toks: The indices of all sentence breaks.

abbrev_toks: The indices of all abbreviations.

ellipsis_toks: The indices of all ellipsis marks.

def _first_pass_annotation(self, aug_tok): (source) ¶

Performs type-based annotation on a single token.

def _tokenize_words(self, plaintext): (source) ¶

Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.

_lang_vars = (source) ¶

Undocumented

_params = (source) ¶

overridden in nltk.tokenize.punkt.PunktSentenceTokenizer

Undocumented

_Token = (source) ¶

The collection of parameters that determines the behavior of the punkt tokenizer.