nltk.tokenize.texttiling.TextTilingTokenizer

class documentation

class TextTilingTokenizer(TokenizerI): (source)

Constructor: TextTilingTokenizer(w, k, similarity_method, stopwords, ...)

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.

>>> from nltk.corpus import brown
>>> tt = TextTilingTokenizer(demo_mode=True)
>>> text = brown.raw()[:4000]
>>> s, ss, d, b = tt.tokenize(text)
>>> b
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]

Parameters
w	Pseudosentence size
k	Size (in sentences) of the block used in the block comparison method
similarity_method	The method used for determining similarity scores: `BLOCK_COMPARISON` (default) or `VOCABULARY_INTRODUCTION`.
stopwords	A list of stopwords that are filtered out (defaults to NLTK's stopwords corpus)
smoothing_method	The method used for smoothing the score plot: `DEFAULT_SMOOTHING` (default)
smoothing_width	The width of the window used by the smoothing method
smoothing_rounds	The number of smoothing passes
cutoff_policy	The policy used to determine the number of boundaries: `HC` (default) or `LC`

Method	`__init__`	Undocumented
Method	`tokenize`	Return a tokenized copy of text, where each "token" represents a separate topic.
Method	`_block_comparison`	Implements the block comparison method
Method	`_create_token_table`	Creates a table of TokenTableFields
Method	`_depth_scores`	Calculates the depth of each gap, i.e. the average difference between the left and right peaks and the gap's score
Method	`_divide_to_tokensequences`	Divides the text into pseudosentences of fixed size
Method	`_identify_boundaries`	Identifies boundaries at the peaks of similarity score differences
Method	`_mark_paragraph_breaks`	Identifies indented text or line breaks as the beginning of paragraphs
Method	`_normalize_boundaries`	Normalize the boundaries identified to the original text's paragraph breaks
Method	`_smooth_scores`	Wraps the smooth function from the SciPy Cookbook

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, w=20, k=10, similarity_method=BLOCK_COMPARISON, stopwords=None, smoothing_method=DEFAULT_SMOOTHING, smoothing_width=2, smoothing_rounds=1, cutoff_policy=HC, demo_mode=False): (source) ¶

Undocumented

def tokenize(self, text): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Return a tokenized copy of text, where each "token" represents a separate topic.

def _block_comparison(self, tokseqs, token_table): (source) ¶

Implements the block comparison method

def _create_token_table(self, token_sequences, par_breaks): (source) ¶

Creates a table of TokenTableFields

def _depth_scores(self, scores): (source) ¶

Calculates the depth of each gap, i.e. the average difference between the left and right peaks and the gap's score

def _divide_to_tokensequences(self, text): (source) ¶

Divides the text into pseudosentences of fixed size

def _identify_boundaries(self, depth_scores): (source) ¶

Identifies boundaries at the peaks of similarity score differences

def _mark_paragraph_breaks(self, text): (source) ¶

Identifies indented text or line breaks as the beginning of paragraphs

def _normalize_boundaries(self, text, boundaries, paragraph_breaks): (source) ¶

Normalize the boundaries identified to the original text's paragraph breaks

def _smooth_scores(self, gap_scores): (source) ¶

Wraps the smooth function from the SciPy Cookbook