module documentation

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

(Note that whitespace from the original text, including newlines, is retained in the output.)

Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag.

>>> text = '''
... (How does it deal with this parenthesis?)  "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this. )]
... '''
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip())))
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this. )]
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip(), realign_boundaries=False)))
(How does it deal with this parenthesis?
-----
)  "It should be part of the
previous sentence.
-----
" "(And the same with this one.
-----
)" ('And this one!
-----
')
"('(And (this)) '?
-----
)" [(and this.
-----
)]

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

.PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

The algorithm for this tokenizer is described in:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
  Boundary Detection.  Computational Linguistics 32: 485-525.
Class PunktBaseClass Includes common components of PunktTrainer and PunktSentenceTokenizer.
Class PunktLanguageVars Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.
Class PunktParameters Stores data used to perform sentence boundary detection with Punkt.
Class PunktSentenceTokenizer A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.
Class PunktToken Stores a token of text with annotations produced during sentence boundary detection.
Class PunktTrainer Learns parameters used in Punkt sentence boundary detection.
Function demo Builds a punkt model and applies it to the same text
Function format_debug_decision Undocumented
Constant DEBUG_DECISION_FMT Undocumented
Constant REASON_ABBR_WITH_ORTHOGRAPHIC_HEURISTIC Undocumented
Constant REASON_ABBR_WITH_SENTENCE_STARTER Undocumented
Constant REASON_DEFAULT_DECISION Undocumented
Constant REASON_INITIAL_WITH_ORTHOGRAPHIC_HEURISTIC Undocumented
Constant REASON_INITIAL_WITH_SPECIAL_ORTHOGRAPHIC_HEURISTIC Undocumented
Constant REASON_KNOWN_COLLOCATION Undocumented
Constant REASON_NUMBER_WITH_ORTHOGRAPHIC_HEURISTIC Undocumented
Function _pair_iter Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element.
Constant _ORTHO_BEG_LC Orthographic context: beginning of a sentence with lower case.
Constant _ORTHO_BEG_UC Orthographic context: beginning of a sentence with upper case.
Constant _ORTHO_LC Orthographic context: occurs with lower case.
Constant _ORTHO_MAP A map from context position and first-letter case to the appropriate orthographic context flag.
Constant _ORTHO_MID_LC Orthographic context: middle of a sentence with lower case.
Constant _ORTHO_MID_UC Orthographic context: middle of a sentence with upper case.
Constant _ORTHO_UC Orthographic context: occurs with upper case.
Constant _ORTHO_UNK_LC Orthographic context: unknown position in a sentence with lower case.
Constant _ORTHO_UNK_UC Orthographic context: unknown position in a sentence with upper case.
Variable _re_non_punct Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)
def demo(text, tok_cls=PunktSentenceTokenizer, train_cls=PunktTrainer): (source)

Builds a punkt model and applies it to the same text

def format_debug_decision(d): (source)

Undocumented

DEBUG_DECISION_FMT: str = (source)

Undocumented

Value
'''Text: %(text)r (at offset %(period_index)d)
Sentence break? %(break_decision)s (%(reason)s)
Collocation? %(collocation)s
%(type1)r:
    known abbreviation: %(type1_in_abbrs)s
    is initial: %(type1_is_initial)s
%(type2)r:
...
REASON_ABBR_WITH_ORTHOGRAPHIC_HEURISTIC: str = (source)

Undocumented

Value
'abbreviation + orthographic heuristic'
REASON_ABBR_WITH_SENTENCE_STARTER: str = (source)

Undocumented

Value
'abbreviation + frequent sentence starter'
REASON_DEFAULT_DECISION: str = (source)

Undocumented

Value
'default decision'
REASON_INITIAL_WITH_ORTHOGRAPHIC_HEURISTIC: str = (source)

Undocumented

Value
'initial + orthographic heuristic'
REASON_INITIAL_WITH_SPECIAL_ORTHOGRAPHIC_HEURISTIC: str = (source)

Undocumented

Value
'initial + special orthographic heuristic'
REASON_KNOWN_COLLOCATION: str = (source)

Undocumented

Value
'known collocation (both words)'
REASON_NUMBER_WITH_ORTHOGRAPHIC_HEURISTIC: str = (source)

Undocumented

Value
'initial + orthographic heuristic'
def _pair_iter(it): (source)

Yields pairs of tokens from the given iterator such that each input token will appear as the first element in a yielded tuple. The last pair will have None as its second element.

_ORTHO_BEG_LC = (source)

Orthographic context: beginning of a sentence with lower case.

Value
1 << 4
_ORTHO_BEG_UC = (source)

Orthographic context: beginning of a sentence with upper case.

Value
1 << 1
_ORTHO_LC = (source)

Orthographic context: occurs with lower case.

Value
_ORTHO_BEG_LC + _ORTHO_MID_LC + _ORTHO_UNK_LC
_ORTHO_MAP = (source)

A map from context position and first-letter case to the appropriate orthographic context flag.

Value
{('initial', 'upper'): _ORTHO_BEG_UC,
 ('internal', 'upper'): _ORTHO_MID_UC,
 ('unknown', 'upper'): _ORTHO_UNK_UC,
 ('initial', 'lower'): _ORTHO_BEG_LC,
 ('internal', 'lower'): _ORTHO_MID_LC,
 ('unknown', 'lower'): _ORTHO_UNK_LC}
_ORTHO_MID_LC = (source)

Orthographic context: middle of a sentence with lower case.

Value
1 << 5
_ORTHO_MID_UC = (source)

Orthographic context: middle of a sentence with upper case.

Value
1 << 2
_ORTHO_UC = (source)

Orthographic context: occurs with upper case.

Value
_ORTHO_BEG_UC + _ORTHO_MID_UC + _ORTHO_UNK_UC
_ORTHO_UNK_LC = (source)

Orthographic context: unknown position in a sentence with lower case.

Value
1 << 6
_ORTHO_UNK_UC = (source)

Orthographic context: unknown position in a sentence with upper case.

Value
1 << 3
_re_non_punct = (source)

Matches token types that are not merely punctuation. (Types for numeric tokens are changed to ##number## and hence contain alpha.)