nltk.tokenize.punkt.PunktSentenceTokenizer

class documentation

class PunktSentenceTokenizer(PunktBaseClass, TokenizerI): (source)

Constructor: PunktSentenceTokenizer(train_text, verbose, lang_vars, token_cls)

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

Method	`__init__`	train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.
Method	`debug_decisions`	Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.
Method	`dump`	Undocumented
Method	`sentences_from_text`	Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.
Method	`sentences_from_text_legacy`	Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as `sentences_from_text`.
Method	`sentences_from_tokens`	Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.
Method	`span_tokenize`	Given a text, generates (start, end) spans of sentences in the text.
Method	`text_contains_sentbreak`	Returns True if the given text includes a sentence break.
Method	`tokenize`	Given a text, returns a list of the sentences in that text.
Method	`train`	Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.
Constant	`PUNCTUATION`	Undocumented
Method	`_annotate_second_pass`	Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).
Method	`_annotate_tokens`	Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.
Method	`_build_sentence_list`	Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings.
Method	`_ortho_heuristic`	Decide whether the given token is the first token in a sentence.
Method	`_realign_boundaries`	Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.
Method	`_second_pass_annotation`	Performs token-based classification over a pair of contiguous tokens updating the first.
Method	`_slices_from_text`	Undocumented
Instance Variable	`_params`	Undocumented

Inherited from PunktBaseClass:

Method	`_annotate_first_pass`	Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:
Method	`_first_pass_annotation`	Performs type-based annotation on a single token.
Method	`_tokenize_words`	Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.
Instance Variable	`_lang_vars`	Undocumented
Instance Variable	`_Token`	The collection of parameters that determines the behavior of the punkt tokenizer.

Inherited from TokenizerI (via PunktBaseClass):

Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, train_text=None, verbose=False, lang_vars=None, token_cls=PunktToken): (source) ¶

overrides nltk.tokenize.punkt.PunktBaseClass.__init__

train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.

def debug_decisions(self, text): (source) ¶

Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.

See format_debug_decision() to help make this output readable.

def dump(self, tokens): (source) ¶

Undocumented

def sentences_from_text(self, text, realign_boundaries=True): (source) ¶

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

def sentences_from_text_legacy(self, text): (source) ¶

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

def sentences_from_tokens(self, tokens): (source) ¶

Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

def span_tokenize(self, text, realign_boundaries=True): (source) ¶

overrides nltk.tokenize.api.TokenizerI.span_tokenize

Given a text, generates (start, end) spans of sentences in the text.

def text_contains_sentbreak(self, text): (source) ¶

Returns True if the given text includes a sentence break.

def tokenize(self, text, realign_boundaries=True): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Given a text, returns a list of the sentences in that text.

def train(self, train_text, verbose=False): (source) ¶

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

PUNCTUATION = (source) ¶

Undocumented

Value

tuple(';:,.!?')

def _annotate_second_pass(self, tokens): (source) ¶

Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).

def _annotate_tokens(self, tokens): (source) ¶

Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.

def _build_sentence_list(self, text, tokens): (source) ¶

Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings.

def _ortho_heuristic(self, aug_tok): (source) ¶

Decide whether the given token is the first token in a sentence.

def _realign_boundaries(self, text, slices): (source) ¶

Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.

For example: "(Sent1.) Sent2." will otherwise be split as:

["(Sent1.", ") Sent1."].

This method will produce:

["(Sent1.)", "Sent2."].

def _second_pass_annotation(self, aug_tok1, aug_tok2): (source) ¶

Performs token-based classification over a pair of contiguous tokens updating the first.

def _slices_from_text(self, text): (source) ¶

Undocumented

_params = (source) ¶

overrides nltk.tokenize.punkt.PunktBaseClass._params

Undocumented