class documentation
Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n').
>>> from nltk.tokenize import LineTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks." >>> LineTokenizer(blanklines='keep').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', '', 'Thanks.'] >>> # same as [l for l in s.split('\n') if l.strip()]: >>> LineTokenizer(blanklines='discard').tokenize(s) ['Good muffins cost $3.88', 'in New York. Please buy me', 'two of them.', 'Thanks.']
Parameters | |
blanklines | Indicates how blank lines should be handled. Valid values are:
|
Method | __init__ |
Undocumented |
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | tokenize |
Return a tokenized copy of s. |
Instance Variable | _blanklines |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Returns | |
iter(tuple(int, int)) | Undocumented |
overrides
nltk.tokenize.api.TokenizerI.tokenize
Return a tokenized copy of s.
Returns | |
list of str | Undocumented |