class documentation

Learns parameters used in Punkt sentence boundary detection.

Method __init__ Undocumented
Method finalize_training Uses data that has been gathered in training to determine likely collocations and sentence starters.
Method find_abbrev_types Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as "rare".
Method freq_threshold Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained.
Method get_params Calculates and returns parameters for sentence boundary detection as derived from training.
Method train Collects training data from a given text. If finalize is True, it will determine all the parameters for sentence boundary detection. If not, this will be delayed until get_params() or finalize_training() is called...
Method train_tokens Collects training data from a given list of tokens.
Constant ABBREV cut-off value whether a 'token' is an abbreviation
Constant ABBREV_BACKOFF upper cut-off for Mikheev's(2002) abbreviation detection algorithm
Constant COLLOCATION minimal log-likelihood value that two tokens need to be considered as a collocation
Constant IGNORE_ABBREV_PENALTY allows the disabling of the abbreviation penalty heuristic, which exponentially disadvantages words that are found at times without a final period.
Constant INCLUDE_ABBREV_COLLOCS this includes as potential collocations all word pairs where the first word is an abbreviation. Such collocations override the orthographic heuristic, but not the sentence starter heuristic. This is overridden by INCLUDE_ALL_COLLOCS, and if both are false, only collocations with initials and ordinals are considered.
Constant INCLUDE_ALL_COLLOCS this includes as potential collocations all word pairs where the first word ends in a period. It may be useful in corpora where there is a lot of variation that makes abbreviations like Mr difficult to identify.
Constant MIN_COLLOC_FREQ this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics. This is useful when INCLUDE_ALL_COLLOCS is True.
Constant SENT_STARTER minimal log-likelihood value that a token requires to be considered as a frequent sentence starter
Static Method _col_log_likelihood A function that will just compute log-likelihood estimate, in the original paper it's described in algorithm 6 and 7.
Static Method _dunning_log_likelihood A function that calculates the modified Dunning log-likelihood ratio scores for abbreviation candidates. The details of how this works is available in the paper.
Method _find_collocations Generates likely collocations and their log-likelihood.
Method _find_sent_starters Uses collocation heuristics for each candidate token to determine if it frequently starts sentences.
Method _freq_threshold Returns a FreqDist containing only data with counts below a given threshold, as well as a mapping (None -> count_removed).
Method _get_orthography_data Collect information about whether each token type occurs with different case patterns (i) overall, (ii) at sentence-initial positions, and (iii) at sentence-internal positions.
Method _get_sentbreak_count Returns the number of sentence breaks marked in a given set of augmented tokens.
Method _is_potential_collocation Returns True if the pair of tokens may form a collocation given log-likelihood statistics.
Method _is_potential_sent_starter Returns True given a token and the token that preceds it if it seems clear that the token is beginning a sentence.
Method _is_rare_abbrev_type it's not already marked as an abbreviation
Method _reclassify_abbrev_types it is period-final and not a known abbreviation; or
Method _train_tokens Undocumented
Method _unique_types Undocumented
Instance Variable _collocation_fdist A frequency distribution giving the frequency of all bigrams in the training data where the first word ends in a period. Bigrams are encoded as tuples of word types. Especially common collocations are extracted from this frequency distribution, and stored in ...
Instance Variable _finalized A flag as to whether the training has been finalized by finding collocations and sentence starters, or whether finalize_training() still needs to be called.
Instance Variable _num_period_toks The number of words ending in period in the training data.
Instance Variable _sent_starter_fdist A frequency distribution giving the frequency of all words that occur at the training data at the beginning of a sentence (after the first pass of annotation). Especially common sentence starters are extracted from this frequency distribution, and stored in ...
Instance Variable _sentbreak_count The total number of sentence breaks identified in training, used for calculating the frequent sentence starter heuristic.
Instance Variable _type_fdist A frequency distribution giving the frequency of each case-normalized token type in the training data.

Inherited from PunktBaseClass:

Method _annotate_first_pass Perform the first pass of annotation, which makes decisions based purely based on the word type of each word:
Method _first_pass_annotation Performs type-based annotation on a single token.
Method _tokenize_words Divide the given text into tokens, using the punkt word segmentation regular expression, and generate the resulting list of tokens augmented as three-tuples with two boolean values for whether the given token occurs at the start of a paragraph or a new line, respectively.
Instance Variable _lang_vars Undocumented
Instance Variable _params Undocumented
Instance Variable _Token The collection of parameters that determines the behavior of the punkt tokenizer.
def __init__(self, train_text=None, verbose=False, lang_vars=None, token_cls=PunktToken): (source)
def finalize_training(self, verbose=False): (source)

Uses data that has been gathered in training to determine likely collocations and sentence starters.

def find_abbrev_types(self): (source)

Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as "rare".

def freq_threshold(self, ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2): (source)

Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained.

def get_params(self): (source)

Calculates and returns parameters for sentence boundary detection as derived from training.

def train(self, text, verbose=False, finalize=True): (source)

Collects training data from a given text. If finalize is True, it will determine all the parameters for sentence boundary detection. If not, this will be delayed until get_params() or finalize_training() is called. If verbose is True, abbreviations found will be listed.

def train_tokens(self, tokens, verbose=False, finalize=True): (source)

Collects training data from a given list of tokens.

ABBREV: float = (source)

cut-off value whether a 'token' is an abbreviation

Value
0.3
ABBREV_BACKOFF: int = (source)

upper cut-off for Mikheev's(2002) abbreviation detection algorithm

Value
5
COLLOCATION: float = (source)

minimal log-likelihood value that two tokens need to be considered as a collocation

Value
7.88
IGNORE_ABBREV_PENALTY: bool = (source)

allows the disabling of the abbreviation penalty heuristic, which exponentially disadvantages words that are found at times without a final period.

Value
False
INCLUDE_ABBREV_COLLOCS: bool = (source)

this includes as potential collocations all word pairs where the first word is an abbreviation. Such collocations override the orthographic heuristic, but not the sentence starter heuristic. This is overridden by INCLUDE_ALL_COLLOCS, and if both are false, only collocations with initials and ordinals are considered.

Value
False
INCLUDE_ALL_COLLOCS: bool = (source)

this includes as potential collocations all word pairs where the first word ends in a period. It may be useful in corpora where there is a lot of variation that makes abbreviations like Mr difficult to identify.

Value
False
MIN_COLLOC_FREQ: int = (source)

this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics. This is useful when INCLUDE_ALL_COLLOCS is True.

Value
1
SENT_STARTER: int = (source)

minimal log-likelihood value that a token requires to be considered as a frequent sentence starter

Value
30
@staticmethod
def _col_log_likelihood(count_a, count_b, count_ab, N): (source)

A function that will just compute log-likelihood estimate, in the original paper it's described in algorithm 6 and 7.

This should be the original Dunning log-likelihood values, unlike the previous log_l function where it used modified Dunning log-likelihood values

@staticmethod
def _dunning_log_likelihood(count_a, count_b, count_ab, N): (source)

A function that calculates the modified Dunning log-likelihood ratio scores for abbreviation candidates. The details of how this works is available in the paper.

def _find_collocations(self): (source)

Generates likely collocations and their log-likelihood.

def _find_sent_starters(self): (source)

Uses collocation heuristics for each candidate token to determine if it frequently starts sentences.

def _freq_threshold(self, fdist, threshold): (source)

Returns a FreqDist containing only data with counts below a given threshold, as well as a mapping (None -> count_removed).

def _get_orthography_data(self, tokens): (source)

Collect information about whether each token type occurs with different case patterns (i) overall, (ii) at sentence-initial positions, and (iii) at sentence-internal positions.

def _get_sentbreak_count(self, tokens): (source)

Returns the number of sentence breaks marked in a given set of augmented tokens.

def _is_potential_collocation(self, aug_tok1, aug_tok2): (source)

Returns True if the pair of tokens may form a collocation given log-likelihood statistics.

def _is_potential_sent_starter(self, cur_tok, prev_tok): (source)

Returns True given a token and the token that preceds it if it seems clear that the token is beginning a sentence.

def _is_rare_abbrev_type(self, cur_tok, next_tok): (source)

A word type is counted as a rare abbreviation if...
  • it's not already marked as an abbreviation
  • it occurs fewer than ABBREV_BACKOFF times
  • either it is followed by a sentence-internal punctuation mark, or it is followed by a lower-case word that sometimes appears with upper case, but never occurs with lower case at the beginning of sentences.

def _reclassify_abbrev_types(self, types): (source)

(Re)classifies each given token if
  • it is period-final and not a known abbreviation; or
  • it is not period-final and is otherwise a known abbreviation

by checking whether its previous classification still holds according to the heuristics of section 3. Yields triples (abbr, score, is_add) where abbr is the type in question, score is its log-likelihood with penalties applied, and is_add specifies whether the present type is a candidate for inclusion or exclusion as an abbreviation, such that:

  • (is_add and score >= 0.3) suggests a new abbreviation; and
  • (not is_add and score < 0.3) suggests excluding an abbreviation.

def _train_tokens(self, tokens, verbose): (source)

Undocumented

def _unique_types(self, tokens): (source)

Undocumented

_collocation_fdist = (source)

A frequency distribution giving the frequency of all bigrams in the training data where the first word ends in a period. Bigrams are encoded as tuples of word types. Especially common collocations are extracted from this frequency distribution, and stored in _params.``collocations <PunktParameters.collocations>``.

_finalized: bool = (source)

A flag as to whether the training has been finalized by finding collocations and sentence starters, or whether finalize_training() still needs to be called.

_num_period_toks: int = (source)

The number of words ending in period in the training data.

_sent_starter_fdist = (source)

A frequency distribution giving the frequency of all words that occur at the training data at the beginning of a sentence (after the first pass of annotation). Especially common sentence starters are extracted from this frequency distribution, and stored in _params.sent_starters.

_sentbreak_count: int = (source)

The total number of sentence breaks identified in training, used for calculating the frequent sentence starter heuristic.

_type_fdist = (source)

A frequency distribution giving the frequency of each case-normalized token type in the training data.