class documentation
class PunktLanguageVars(object): (source)
Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.
Method | __getstate__ |
Undocumented |
Method | __setstate__ |
Undocumented |
Method | period |
Compiles and returns a regular expression to find contexts including possible sentence boundaries. |
Method | word |
Tokenize a string to split off punctuation other than periods |
Class Variable | __slots__ |
Undocumented |
Class Variable | internal |
sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token. |
Class Variable | re |
Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !). |
Class Variable | sent |
Characters which are candidates for sentence boundaries |
Method | _word |
Compiles and returns a regular expression for word tokenization |
Class Variable | _period |
Format of a regular expression to find contexts including possible sentence boundaries. Matches token which the possible sentence boundary ends, and matches the following token within a lookahead expression. |
Class Variable | _re |
Hyphen and ellipsis are multi-character punctuation |
Class Variable | _re |
Excludes some characters from starting word tokens |
Class Variable | _word |
Format of a regular expression to split punctuation from words, excluding period. |
Instance Variable | _re |
Undocumented |
Instance Variable | _re |
Undocumented |
Property | _re |
Undocumented |
Property | _re |
Undocumented |
Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).
Format of a regular expression to find contexts including possible sentence boundaries. Matches token which the possible sentence boundary ends, and matches the following token within a lookahead expression.