nltk.translate.gale

module documentation

(source)

A port of the Gale-Church Aligner.

Gale & Church (1993), A Program for Aligning Sentences in Bilingual Corpora. http://aclweb.org/anthology/J93-1004.pdf

Class	`LanguageIndependent`	Undocumented
Function	`align_blocks`	Return the sentence alignment of two text blocks (usually paragraphs).
Function	`align_log_prob`	Returns the log probability of the two sentences C{source_sents[i]}, C{target_sents[j]} being aligned with a specific C{alignment}.
Function	`align_texts`	Creates the sentence alignment of two texts.
Function	`erfcc`	Complementary error function.
Function	`norm_cdf`	Return the area under the normal distribution from M{-∞..x}.
Function	`parse_token_stream`	Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens) and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.
Function	`split_at`	Splits an iterator C{it} at values of C{split_value}.
Function	`trace`	Traverse the alignment cost from the tracebacks and retrieves appropriate sentence pairs.
Constant	`LOG2`	Undocumented

def align_blocks(source_sents_lens, target_sents_lens, params=LanguageIndependent): (source) ¶

Return the sentence alignment of two text blocks (usually paragraphs).

>>> align_blocks([5,5,5], [7,7,7])
[(0, 0), (1, 1), (2, 2)]
>>> align_blocks([10,5,5], [12,20])
[(0, 0), (1, 1), (2, 1)]
>>> align_blocks([12,20], [10,5,5])
[(0, 0), (1, 1), (1, 2)]
>>> align_blocks([10,2,10,10,2,10], [12,3,20,3,12])
[(0, 0), (1, 1), (2, 2), (3, 2), (4, 3), (5, 4)]

@param source_sents_lens: The list of source sentence lengths. @param target_sents_lens: The list of target sentence lengths. @param params: the sentence alignment parameters. @return: The sentence alignments, a list of index pairs.

def align_log_prob(i, j, source_sents, target_sents, alignment, params): (source) ¶

Returns the log probability of the two sentences C{source_sents[i]}, C{target_sents[j]} being aligned with a specific C{alignment}.

@param i: The offset of the source sentence. @param j: The offset of the target sentence. @param source_sents: The list of source sentence lengths. @param target_sents: The list of target sentence lengths. @param alignment: The alignment type, a tuple of two integers. @param params: The sentence alignment parameters.

@returns: The log probability of a specific alignment between the two sentences, given the parameters.

def align_texts(source_blocks, target_blocks, params=LanguageIndependent): (source) ¶

Creates the sentence alignment of two texts.

Texts can consist of several blocks. Block boundaries cannot be crossed by sentence alignment links.

Each block consists of a list that contains the lengths (in characters) of the sentences in this block.

@param source_blocks: The list of blocks in the source text. @param target_blocks: The list of blocks in the target text. @param params: the sentence alignment parameters.

@returns: A list of sentence alignment lists

def erfcc(x): (source) ¶

Complementary error function.

def norm_cdf(x): (source) ¶

Return the area under the normal distribution from M{-∞..x}.

def parse_token_stream(stream, soft_delimiter, hard_delimiter): (source) ¶

Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens) and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.

def split_at(it, split_value): (source) ¶

Splits an iterator C{it} at values of C{split_value}.

Each instance of C{split_value} is swallowed. The iterator produces subiterators which need to be consumed fully before the next subiterator can be used.

def trace(backlinks, source_sents_lens, target_sents_lens): (source) ¶

Traverse the alignment cost from the tracebacks and retrieves appropriate sentence pairs.

Parameters
backlinks:dict	A dictionary where the key is the alignment points and value is the cost (referencing the LanguageIndependent.PRIORS)
source_sents_lens:list(int)	A list of target sentences' lengths
target_sents_lens:list(int)	A list of target sentences' lengths

LOG2 = (source) ¶

Undocumented

Value

math.log(2)

nltk.translate.gale_church

`nltk.translate.gale_church`