class documentation

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

Method __init__ train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.
Method debug_decisions Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.
Method dump Undocumented
Method sentences_from_text Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.
Method sentences_from_text_legacy Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.
Method sentences_from_tokens Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.
Method span_tokenize Given a text, generates (start, end) spans of sentences in the text.
Method text_contains_sentbreak Returns True if the given text includes a sentence break.
Method tokenize Given a text, returns a list of the sentences in that text.
Method train Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.
Constant PUNCTUATION Undocumented
Method _annotate_second_pass Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).
Method _annotate_tokens Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.
Method _build_sentence_list Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings.
Method _ortho_heuristic Decide whether the given token is the first token in a sentence.
Method _realign_boundaries Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.
Method _second_pass_annotation Performs token-based classification over a pair of contiguous tokens updating the first.
Method _slices_from_text Undocumented
Instance Variable _params Undocumented

Inherited from PunktBaseClass:

Method _annotate_first_pass Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:
Method _first_pass_annotation Performs type-based annotation on a single token.
Method _tokenize_words Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.
Instance Variable _lang_vars Undocumented
Instance Variable _Token The collection of parameters that determines the behavior of the punkt tokenizer.

Inherited from TokenizerI (via PunktBaseClass):

Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.:
def __init__(self, train_text=None, verbose=False, lang_vars=None, token_cls=PunktToken): (source)

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

def debug_decisions(self, text): (source)

Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.

See format_debug_decision() to help make this output readable.

def dump(self, tokens): (source)

Undocumented

def sentences_from_text(self, text, realign_boundaries=True): (source)

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

def sentences_from_text_legacy(self, text): (source)

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

def sentences_from_tokens(self, tokens): (source)

Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

def span_tokenize(self, text, realign_boundaries=True): (source)

Given a text, generates (start, end) spans of sentences in the text.

def text_contains_sentbreak(self, text): (source)

Returns True if the given text includes a sentence break.

def tokenize(self, text, realign_boundaries=True): (source)

Given a text, returns a list of the sentences in that text.

def train(self, train_text, verbose=False): (source)

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

PUNCTUATION = (source)

Undocumented

Value
tuple(';:,.!?')
def _annotate_second_pass(self, tokens): (source)

Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).

def _annotate_tokens(self, tokens): (source)

Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.

def _build_sentence_list(self, text, tokens): (source)

Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings.

def _ortho_heuristic(self, aug_tok): (source)

Decide whether the given token is the first token in a sentence.

def _realign_boundaries(self, text, slices): (source)

Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.

For example: "(Sent1.) Sent2." will otherwise be split as:

["(Sent1.", ") Sent1."].

This method will produce:

["(Sent1.)", "Sent2."].
def _second_pass_annotation(self, aug_tok1, aug_tok2): (source)

Performs token-based classification over a pair of contiguous tokens updating the first.

def _slices_from_text(self, text): (source)

Undocumented