nltk.tokenize.punkt.PunktLanguageVars

class documentation

class PunktLanguageVars(object): (source)

Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.

Method	`__getstate__`	Undocumented
Method	`__setstate__`	Undocumented
Method	`period_context_re`	Compiles and returns a regular expression to find contexts including possible sentence boundaries.
Method	`word_tokenize`	Tokenize a string to split off punctuation other than periods
Class Variable	`__slots__`	Undocumented
Class Variable	`internal_punctuation`	sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.
Class Variable	`re_boundary_realignment`	Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).
Class Variable	`sent_end_chars`	Characters which are candidates for sentence boundaries
Method	`_word_tokenizer_re`	Compiles and returns a regular expression for word tokenization
Class Variable	`_period_context_fmt`	Format of a regular expression to find contexts including possible sentence boundaries. Matches token which the possible sentence boundary ends, and matches the following token within a lookahead expression.
Class Variable	`_re_multi_char_punct`	Hyphen and ellipsis are multi-character punctuation
Class Variable	`_re_word_start`	Excludes some characters from starting word tokens
Class Variable	`_word_tokenize_fmt`	Format of a regular expression to split punctuation from words, excluding period.
Instance Variable	`_re_period_context`	Undocumented
Instance Variable	`_re_word_tokenizer`	Undocumented
Property	`_re_non_word_chars`	Undocumented
Property	`_re_sent_end_chars`	Undocumented

def __getstate__(self): (source) ¶

Undocumented

def __setstate__(self, state): (source) ¶

Undocumented

def period_context_re(self): (source) ¶

Compiles and returns a regular expression to find contexts including possible sentence boundaries.

def word_tokenize(self, s): (source) ¶

Tokenize a string to split off punctuation other than periods

__slots__: tuple[str, ...] = (source) ¶

Undocumented

internal_punctuation: str = (source) ¶

sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.

re_boundary_realignment = (source) ¶

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

sent_end_chars: tuple[str, ...] = (source) ¶

Characters which are candidates for sentence boundaries

def _word_tokenizer_re(self): (source) ¶

Compiles and returns a regular expression for word tokenization

_period_context_fmt: str = (source) ¶

Format of a regular expression to find contexts including possible sentence boundaries. Matches token which the possible sentence boundary ends, and matches the following token within a lookahead expression.

_re_multi_char_punct: str = (source) ¶

Hyphen and ellipsis are multi-character punctuation

_re_word_start: str = (source) ¶

Excludes some characters from starting word tokens

_word_tokenize_fmt: str = (source) ¶

Format of a regular expression to split punctuation from words, excluding period.

_re_period_context = (source) ¶

Undocumented

_re_word_tokenizer = (source) ¶

Undocumented

@property
_re_non_word_chars = (source) ¶

Undocumented

@property
_re_sent_end_chars = (source) ¶

Undocumented