class documentation

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.

>>> from nltk.corpus import brown
>>> tt = TextTilingTokenizer(demo_mode=True)
>>> text = brown.raw()[:4000]
>>> s, ss, d, b = tt.tokenize(text)
>>> b
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]
Parameters
wPseudosentence size
kSize (in sentences) of the block used in the block comparison method
similarity_methodThe method used for determining similarity scores: BLOCK_COMPARISON (default) or VOCABULARY_INTRODUCTION.
stopwordsA list of stopwords that are filtered out (defaults to NLTK's stopwords corpus)
smoothing_methodThe method used for smoothing the score plot: DEFAULT_SMOOTHING (default)
smoothing_widthThe width of the window used by the smoothing method
smoothing_roundsThe number of smoothing passes
cutoff_policyThe policy used to determine the number of boundaries: HC (default) or LC
Method __init__ Undocumented
Method tokenize Return a tokenized copy of text, where each "token" represents a separate topic.
Method _block_comparison Implements the block comparison method
Method _create_token_table Creates a table of TokenTableFields
Method _depth_scores Calculates the depth of each gap, i.e. the average difference between the left and right peaks and the gap's score
Method _divide_to_tokensequences Divides the text into pseudosentences of fixed size
Method _identify_boundaries Identifies boundaries at the peaks of similarity score differences
Method _mark_paragraph_breaks Identifies indented text or line breaks as the beginning of paragraphs
Method _normalize_boundaries Normalize the boundaries identified to the original text's paragraph breaks
Method _smooth_scores Wraps the smooth function from the SciPy Cookbook

Inherited from TokenizerI:

Method span_tokenize Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.:
def __init__(self, w=20, k=10, similarity_method=BLOCK_COMPARISON, stopwords=None, smoothing_method=DEFAULT_SMOOTHING, smoothing_width=2, smoothing_rounds=1, cutoff_policy=HC, demo_mode=False): (source)

Undocumented

def tokenize(self, text): (source)

Return a tokenized copy of text, where each "token" represents a separate topic.

def _block_comparison(self, tokseqs, token_table): (source)

Implements the block comparison method

def _create_token_table(self, token_sequences, par_breaks): (source)

Creates a table of TokenTableFields

def _depth_scores(self, scores): (source)

Calculates the depth of each gap, i.e. the average difference between the left and right peaks and the gap's score

def _divide_to_tokensequences(self, text): (source)

Divides the text into pseudosentences of fixed size

def _identify_boundaries(self, depth_scores): (source)

Identifies boundaries at the peaks of similarity score differences

def _mark_paragraph_breaks(self, text): (source)

Identifies indented text or line breaks as the beginning of paragraphs

def _normalize_boundaries(self, text, boundaries, paragraph_breaks): (source)

Normalize the boundaries identified to the original text's paragraph breaks

def _smooth_scores(self, gap_scores): (source)

Wraps the smooth function from the SciPy Cookbook