nltk.tokenize.punkt

module documentation

(source)

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

(Note that whitespace from the original text, including newlines, is retained in the output.)

Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag.

>>> text = '''
... (How does it deal with this parenthesis?)  "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this. )]
... '''
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip())))
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this. )]
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip(), realign_boundaries=False)))
(How does it deal with this parenthesis?
-----
)  "It should be part of the
previous sentence.
-----
" "(And the same with this one.
-----
)" ('And this one!
-----
')
"('(And (this)) '?
-----
)" [(and this.
-----
)]

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

.PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

The algorithm for this tokenizer is described in:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
  Boundary Detection.  Computational Linguistics 32: 485-525.

Class	`PunktBaseClass`	Includes common components of PunktTrainer and PunktSentenceTokenizer.
Class	`PunktLanguageVars`	Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.
Class	`PunktParameters`	Stores data used to perform sentence boundary detection with Punkt.
Class	`PunktSentenceTokenizer`	A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.
Class	`PunktToken`	Stores a token of text with annotations produced during sentence boundary detection.
Class	`PunktTrainer`	Learns parameters used in Punkt sentence boundary detection.
Function	`demo`	Builds a punkt model and applies it to the same text
Function	`format_debug_decision`	Undocumented
Constant	`DEBUG_DECISION_FMT`	Undocumented
Constant	`REASON_ABBR_WITH_ORTHOGRAPHIC_HEURISTIC`	Undocumented
Constant	`REASON_ABBR_WITH_SENTENCE_STARTER`	Undocumented
Constant	`REASON_DEFAULT_DECISION`	Undocumented
Constant	`REASON_INITIAL_WITH_ORTHOGRAPHIC_HEURISTIC`	Undocumented
Constant	`REASON_INITIAL_WITH_SPECIAL_ORTHOGRAPHIC_HEURISTIC`	Undocumented
Constant	`REASON_KNOWN_COLLOCATION`	Undocumented
Constant	`REASON_NUMBER_WITH_ORTHOGRAPHIC_HEURISTIC`	Undocumented
Function	`_pair_iter`	Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element.
Constant	`_ORTHO_BEG_LC`	Orthographic context: beginning of a sentence with lower case.
Constant	`_ORTHO_BEG_UC`	Orthographic context: beginning of a sentence with upper case.
Constant	`_ORTHO_LC`	Orthographic context: occurs with lower case.
Constant	`_ORTHO_MAP`	A map from context position and first-letter case to the appropriate orthographic context flag.
Constant	`_ORTHO_MID_LC`	Orthographic context: middle of a sentence with lower case.
Constant	`_ORTHO_MID_UC`	Orthographic context: middle of a sentence with upper case.
Constant	`_ORTHO_UC`	Orthographic context: occurs with upper case.
Constant	`_ORTHO_UNK_LC`	Orthographic context: unknown position in a sentence with lower case.
Constant	`_ORTHO_UNK_UC`	Orthographic context: unknown position in a sentence with upper case.
Variable	`_re_non_punct`	Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)

def demo(text, tok_cls=PunktSentenceTokenizer, train_cls=PunktTrainer): (source) ¶

Builds a punkt model and applies it to the same text

def format_debug_decision(d): (source) ¶

Undocumented

DEBUG_DECISION_FMT: str = (source) ¶

Undocumented

Value

'''Text: %(text)r (at offset %(period_index)d)
Sentence break? %(break_decision)s (%(reason)s)
Collocation? %(collocation)s
%(type1)r:
    known abbreviation: %(type1_in_abbrs)s
    is initial: %(type1_is_initial)s
%(type2)r:
...

REASON_ABBR_WITH_ORTHOGRAPHIC_HEURISTIC: str = (source) ¶

Undocumented

Value

'abbreviation + orthographic heuristic'

REASON_ABBR_WITH_SENTENCE_STARTER: str = (source) ¶

Undocumented

Value

'abbreviation + frequent sentence starter'

REASON_DEFAULT_DECISION: str = (source) ¶

Undocumented

Value

'default decision'

REASON_INITIAL_WITH_ORTHOGRAPHIC_HEURISTIC: str = (source) ¶

Undocumented

Value

'initial + orthographic heuristic'

REASON_INITIAL_WITH_SPECIAL_ORTHOGRAPHIC_HEURISTIC: str = (source) ¶

Undocumented

Value

'initial + special orthographic heuristic'

REASON_KNOWN_COLLOCATION: str = (source) ¶

Undocumented

Value

'known collocation (both words)'

REASON_NUMBER_WITH_ORTHOGRAPHIC_HEURISTIC: str = (source) ¶

Undocumented

Value

'initial + orthographic heuristic'

def _pair_iter(it): (source) ¶

Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element.

_ORTHO_BEG_LC = (source) ¶

Orthographic context: beginning of a sentence with lower case.

Value

1 << 4

_ORTHO_BEG_UC = (source) ¶

Orthographic context: beginning of a sentence with upper case.

Value

1 << 1

_ORTHO_LC = (source) ¶

Orthographic context: occurs with lower case.

Value

_ORTHO_BEG_LC + _ORTHO_MID_LC + _ORTHO_UNK_LC

_ORTHO_MAP = (source) ¶

A map from context position and first-letter case to the appropriate orthographic context flag.

Value

{('initial', 'upper'): _ORTHO_BEG_UC,
 ('internal', 'upper'): _ORTHO_MID_UC,
 ('unknown', 'upper'): _ORTHO_UNK_UC,
 ('initial', 'lower'): _ORTHO_BEG_LC,
 ('internal', 'lower'): _ORTHO_MID_LC,
 ('unknown', 'lower'): _ORTHO_UNK_LC}

_ORTHO_MID_LC = (source) ¶

Orthographic context: middle of a sentence with lower case.

Value

1 << 5

_ORTHO_MID_UC = (source) ¶

Orthographic context: middle of a sentence with upper case.

Value

1 << 2

_ORTHO_UC = (source) ¶

Orthographic context: occurs with upper case.

Value

_ORTHO_BEG_UC + _ORTHO_MID_UC + _ORTHO_UNK_UC

_ORTHO_UNK_LC = (source) ¶

Orthographic context: unknown position in a sentence with lower case.

Value

1 << 6

_ORTHO_UNK_UC = (source) ¶

Orthographic context: unknown position in a sentence with upper case.

Value

1 << 3

_re_non_punct = (source) ¶

Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)