class documentation

class PunktLanguageVars(object): (source)

View In Hierarchy

Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.

Method __getstate__ Undocumented
Method __setstate__ Undocumented
Method period_context_re Compiles and returns a regular expression to find contexts including possible sentence boundaries.
Method word_tokenize Tokenize a string to split off punctuation other than periods
Class Variable __slots__ Undocumented
Class Variable internal_punctuation sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.
Class Variable re_boundary_realignment Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).
Class Variable sent_end_chars Characters which are candidates for sentence boundaries
Method _word_tokenizer_re Compiles and returns a regular expression for word tokenization
Class Variable _period_context_fmt Format of a regular expression to find contexts including possible sentence boundaries. Matches token which the possible sentence boundary ends, and matches the following token within a lookahead expression.
Class Variable _re_multi_char_punct Hyphen and ellipsis are multi-character punctuation
Class Variable _re_word_start Excludes some characters from starting word tokens
Class Variable _word_tokenize_fmt Format of a regular expression to split punctuation from words, excluding period.
Instance Variable _re_period_context Undocumented
Instance Variable _re_word_tokenizer Undocumented
Property _re_non_word_chars Undocumented
Property _re_sent_end_chars Undocumented
def __getstate__(self): (source)

Undocumented

def __setstate__(self, state): (source)

Undocumented

def period_context_re(self): (source)

Compiles and returns a regular expression to find contexts including possible sentence boundaries.

def word_tokenize(self, s): (source)

Tokenize a string to split off punctuation other than periods

__slots__: tuple[str, ...] = (source)

Undocumented

internal_punctuation: str = (source)

sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.

re_boundary_realignment = (source)

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

sent_end_chars: tuple[str, ...] = (source)

Characters which are candidates for sentence boundaries

def _word_tokenizer_re(self): (source)

Compiles and returns a regular expression for word tokenization

_period_context_fmt: str = (source)

Format of a regular expression to find contexts including possible sentence boundaries. Matches token which the possible sentence boundary ends, and matches the following token within a lookahead expression.

_re_multi_char_punct: str = (source)

Hyphen and ellipsis are multi-character punctuation

_re_word_start: str = (source)

Excludes some characters from starting word tokens

_word_tokenize_fmt: str = (source)

Format of a regular expression to split punctuation from words, excluding period.

_re_period_context = (source)

Undocumented

_re_word_tokenizer = (source)

Undocumented

@property
_re_non_word_chars = (source)

Undocumented

@property
_re_sent_end_chars = (source)

Undocumented