class documentation

Includes common components of PunktTrainer and PunktSentenceTokenizer.

Method __init__ Undocumented
Method _annotate_first_pass Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:
Method _first_pass_annotation Performs type-based annotation on a single token.
Method _tokenize_words Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.
Instance Variable _lang_vars Undocumented
Instance Variable _params Undocumented
Instance Variable _Token The collection of parameters that determines the behavior of the punkt tokenizer.
def __init__(self, lang_vars=None, token_cls=PunktToken, params=None): (source)
def _annotate_first_pass(self, tokens): (source)

Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:

  • '?', '!', and '.' are marked as sentence breaks.
  • sequences of two or more periods are marked as ellipsis.
  • any word ending in '.' that's a known abbreviation is marked as an abbreviation.
  • any other word ending in '.' is marked as a sentence break.

Return these annotations as a tuple of three sets:

  • sentbreak_toks: The indices of all sentence breaks.
  • abbrev_toks: The indices of all abbreviations.
  • ellipsis_toks: The indices of all ellipsis marks.
def _first_pass_annotation(self, aug_tok): (source)

Performs type-based annotation on a single token.

def _tokenize_words(self, plaintext): (source)

Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.

_lang_vars = (source)

Undocumented

The collection of parameters that determines the behavior of the punkt tokenizer.