A port of the Gale-Church Aligner.
Gale & Church (1993), A Program for Aligning Sentences in Bilingual Corpora. http://aclweb.org/anthology/J93-1004.pdf
Class |
|
Undocumented |
Function | align |
Return the sentence alignment of two text blocks (usually paragraphs). |
Function | align |
Returns the log probability of the two sentences C{source_sents[i]}, C{target_sents[j]} being aligned with a specific C{alignment}. |
Function | align |
Creates the sentence alignment of two texts. |
Function | erfcc |
Complementary error function. |
Function | norm |
Return the area under the normal distribution from M{-∞..x}. |
Function | parse |
Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens) and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function. |
Function | split |
Splits an iterator C{it} at values of C{split_value}. |
Function | trace |
Traverse the alignment cost from the tracebacks and retrieves appropriate sentence pairs. |
Constant | LOG2 |
Undocumented |
Return the sentence alignment of two text blocks (usually paragraphs).
>>> align_blocks([5,5,5], [7,7,7]) [(0, 0), (1, 1), (2, 2)] >>> align_blocks([10,5,5], [12,20]) [(0, 0), (1, 1), (2, 1)] >>> align_blocks([12,20], [10,5,5]) [(0, 0), (1, 1), (1, 2)] >>> align_blocks([10,2,10,10,2,10], [12,3,20,3,12]) [(0, 0), (1, 1), (2, 2), (3, 2), (4, 3), (5, 4)]
@param source_sents_lens: The list of source sentence lengths. @param target_sents_lens: The list of target sentence lengths. @param params: the sentence alignment parameters. @return: The sentence alignments, a list of index pairs.
Returns the log probability of the two sentences C{source_sents[i]}, C{target_sents[j]} being aligned with a specific C{alignment}.
@param i: The offset of the source sentence. @param j: The offset of the target sentence. @param source_sents: The list of source sentence lengths. @param target_sents: The list of target sentence lengths. @param alignment: The alignment type, a tuple of two integers. @param params: The sentence alignment parameters.
@returns: The log probability of a specific alignment between the two sentences, given the parameters.
Creates the sentence alignment of two texts.
Texts can consist of several blocks. Block boundaries cannot be crossed by sentence alignment links.
Each block consists of a list that contains the lengths (in characters) of the sentences in this block.
@param source_blocks: The list of blocks in the source text. @param target_blocks: The list of blocks in the target text. @param params: the sentence alignment parameters.
@returns: A list of sentence alignment lists
Parses a stream of tokens and splits it into sentences (using C{soft_delimiter} tokens) and blocks (using C{hard_delimiter} tokens) for use with the L{align_texts} function.
Splits an iterator C{it} at values of C{split_value}.
Each instance of C{split_value} is swallowed. The iterator produces subiterators which need to be consumed fully before the next subiterator can be used.
Traverse the alignment cost from the tracebacks and retrieves appropriate sentence pairs.
Parameters | |
backlinks:dict | A dictionary where the key is the alignment points and value is the cost (referencing the LanguageIndependent.PRIORS) |
source | A list of target sentences' lengths |
target | A list of target sentences' lengths |