class documentation
class PunktBaseClass(object): (source)
Known subclasses: nltk.tokenize.punkt.PunktSentenceTokenizer
, nltk.tokenize.punkt.PunktTrainer
Constructor: PunktBaseClass(lang_vars, token_cls, params)
Includes common components of PunktTrainer and PunktSentenceTokenizer.
Method | __init__ |
Undocumented |
Method | _annotate |
Perform the first pass of annotation, which makes decisions based purely based on the word type of each word: |
Method | _first |
Performs type-based annotation on a single token. |
Method | _tokenize |
Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively. |
Instance Variable | _lang |
Undocumented |
Instance Variable | _params |
Undocumented |
Instance Variable | _ |
The collection of parameters that determines the behavior of the punkt tokenizer. |
Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:
- '?', '!', and '.' are marked as sentence breaks.
- sequences of two or more periods are marked as ellipsis.
- any word ending in '.' that's a known abbreviation is marked as an abbreviation.
- any other word ending in '.' is marked as a sentence break.
Return these annotations as a tuple of three sets:
- sentbreak_toks: The indices of all sentence breaks.
- abbrev_toks: The indices of all abbreviations.
- ellipsis_toks: The indices of all ellipsis marks.