class PunktSentenceTokenizer(PunktBaseClass, TokenizerI): (source)
Constructor: PunktSentenceTokenizer(train_text, verbose, lang_vars, token_cls)
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.
Method | __init__ |
train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object. |
Method | debug |
Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made. |
Method | dump |
Undocumented |
Method | sentences |
Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period. |
Method | sentences |
Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text. |
Method | sentences |
Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence. |
Method | span |
Given a text, generates (start, end) spans of sentences in the text. |
Method | text |
Returns True if the given text includes a sentence break. |
Method | tokenize |
Given a text, returns a list of the sentences in that text. |
Method | train |
Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance. |
Constant | PUNCTUATION |
Undocumented |
Method | _annotate |
Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3). |
Method | _annotate |
Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks. |
Method | _build |
Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings. |
Method | _ortho |
Decide whether the given token is the first token in a sentence. |
Method | _realign |
Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence. |
Method | _second |
Performs token-based classification over a pair of contiguous tokens updating the first. |
Method | _slices |
Undocumented |
Instance Variable | _params |
Undocumented |
Inherited from PunktBaseClass
:
Method | _annotate |
Perform the first pass of annotation, which makes decisions based purely based on the word type of each word: |
Method | _first |
Performs type-based annotation on a single token. |
Method | _tokenize |
Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively. |
Instance Variable | _lang |
Undocumented |
Instance Variable | _ |
The collection of parameters that determines the behavior of the punkt tokenizer. |
Inherited from TokenizerI
(via PunktBaseClass
):
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
train_text can either be the sole training text for this sentence boundary detector, or can be a PunktParameters object.
Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.
See format_debug_decision() to help make this output readable.
Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.
Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.
nltk.tokenize.api.TokenizerI.tokenize
Given a text, returns a list of the sentences in that text.
Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.
Performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).
Given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.
Given the original text and the list of augmented word tokens, construct and return a tokenized list of sentence strings.
Attempts to realign punctuation that falls after the period but should otherwise be included in the same sentence.
For example: "(Sent1.) Sent2." will otherwise be split as:
["(Sent1.", ") Sent1."].
This method will produce:
["(Sent1.)", "Sent2."].