Punkt Sentence Tokenizer
This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.
The NLTK data package includes a pre-trained Punkt tokenizer for English.
>>> import nltk.data >>> text = ''' ... Punkt knows that the periods in Mr. Smith and Johann S. Bach ... do not mark sentence boundaries. And sometimes sentences ... can start with non-capitalized words. i is a good variable ... name. ... ''' >>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') >>> print('\n-----\n'.join(sent_detector.tokenize(text.strip()))) Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries. ----- And sometimes sentences can start with non-capitalized words. ----- i is a good variable name.
(Note that whitespace from the original text, including newlines, is retained in the output.)
Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag.
>>> text = ''' ... (How does it deal with this parenthesis?) "It should be part of the ... previous sentence." "(And the same with this one.)" ('And this one!') ... "('(And (this)) '?)" [(and this. )] ... ''' >>> print('\n-----\n'.join( ... sent_detector.tokenize(text.strip()))) (How does it deal with this parenthesis?) ----- "It should be part of the previous sentence." ----- "(And the same with this one.)" ----- ('And this one!') ----- "('(And (this)) '?)" ----- [(and this. )] >>> print('\n-----\n'.join( ... sent_detector.tokenize(text.strip(), realign_boundaries=False))) (How does it deal with this parenthesis? ----- ) "It should be part of the previous sentence. ----- " "(And the same with this one. ----- )" ('And this one! ----- ') "('(And (this)) '? ----- )" [(and this. ----- )]
However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.
.PunktTrainer
learns parameters such as a list of abbreviations
(without supervision) from portions of text. Using a PunktTrainer directly
allows for incremental training and modification of the hyper-parameters used
to decide what is considered an abbreviation, etc.
The algorithm for this tokenizer is described in:
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32: 485-525.
Class |
|
Includes common components of PunktTrainer and PunktSentenceTokenizer. |
Class |
|
Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors. |
Class |
|
Stores data used to perform sentence boundary detection with Punkt. |
Class |
|
A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages. |
Class |
|
Stores a token of text with annotations produced during sentence boundary detection. |
Class |
|
Learns parameters used in Punkt sentence boundary detection. |
Function | demo |
Builds a punkt model and applies it to the same text |
Function | format |
Undocumented |
Constant | DEBUG |
Undocumented |
Constant | REASON |
Undocumented |
Constant | REASON |
Undocumented |
Constant | REASON |
Undocumented |
Constant | REASON |
Undocumented |
Constant | REASON |
Undocumented |
Constant | REASON |
Undocumented |
Constant | REASON |
Undocumented |
Function | _pair |
Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element. |
Constant | _ORTHO |
Orthographic context: beginning of a sentence with lower case. |
Constant | _ORTHO |
Orthographic context: beginning of a sentence with upper case. |
Constant | _ORTHO |
Orthographic context: occurs with lower case. |
Constant | _ORTHO |
A map from context position and first-letter case to the appropriate orthographic context flag. |
Constant | _ORTHO |
Orthographic context: middle of a sentence with lower case. |
Constant | _ORTHO |
Orthographic context: middle of a sentence with upper case. |
Constant | _ORTHO |
Orthographic context: occurs with upper case. |
Constant | _ORTHO |
Orthographic context: unknown position in a sentence with lower case. |
Constant | _ORTHO |
Orthographic context: unknown position in a sentence with upper case. |
Variable | _re |
Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.) |
Undocumented
Value |
|
Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element.
A map from context position and first-letter case to the appropriate orthographic context flag.
Value |
|