class TextTilingTokenizer(TokenizerI): (source)
Constructor: TextTilingTokenizer(w, k, similarity_method, stopwords, ...)
Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.
The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.
>>> from nltk.corpus import brown >>> tt = TextTilingTokenizer(demo_mode=True) >>> text = brown.raw()[:4000] >>> s, ss, d, b = tt.tokenize(text) >>> b [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]
Parameters | |
w | Pseudosentence size |
k | Size (in sentences) of the block used in the block comparison method |
similarity | The method used for determining similarity scores:
BLOCK_COMPARISON (default) or VOCABULARY_INTRODUCTION . |
stopwords | A list of stopwords that are filtered out (defaults to NLTK's stopwords corpus) |
smoothing | The method used for smoothing the score plot:
DEFAULT_SMOOTHING (default) |
smoothing | The width of the window used by the smoothing method |
smoothing | The number of smoothing passes |
cutoff | The policy used to determine the number of boundaries:
HC (default) or LC |
Method | __init__ |
Undocumented |
Method | tokenize |
Return a tokenized copy of text, where each "token" represents a separate topic. |
Method | _block |
Implements the block comparison method |
Method | _create |
Creates a table of TokenTableFields |
Method | _depth |
Calculates the depth of each gap, i.e. the average difference between the left and right peaks and the gap's score |
Method | _divide |
Divides the text into pseudosentences of fixed size |
Method | _identify |
Identifies boundaries at the peaks of similarity score differences |
Method | _mark |
Identifies indented text or line breaks as the beginning of paragraphs |
Method | _normalize |
Normalize the boundaries identified to the original text's paragraph breaks |
Method | _smooth |
Wraps the smooth function from the SciPy Cookbook |
Inherited from TokenizerI
:
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
Undocumented
nltk.tokenize.api.TokenizerI.tokenize
Return a tokenized copy of text, where each "token" represents a separate topic.
Calculates the depth of each gap, i.e. the average difference between the left and right peaks and the gap's score