nltk.tokenize.simple.LineTokenizer

class documentation

class LineTokenizer(TokenizerI): (source)

Constructor: LineTokenizer(blanklines)

Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n').

>>> from nltk.tokenize import LineTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> LineTokenizer(blanklines='keep').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']
>>> # same as [l for l in s.split('\n') if l.strip()]:
>>> LineTokenizer(blanklines='discard').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', 'Thanks.']

Parameters

blanklines

Indicates how blank lines should be handled. Valid values are:

discard: strip blank lines out of the token list before returning it.

A line is considered blank if it contains only whitespace characters.
keep: leave all blank lines in the token list.
discard-eof: if the string ends with a newline, then do not generate

a corresponding token '' after that newline.

Method	`__init__`	Undocumented
Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`tokenize`	Return a tokenized copy of s.
Instance Variable	`_blanklines`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, blanklines='discard'): (source) ¶

Undocumented

def span_tokenize(self, s): (source) ¶

overrides nltk.tokenize.api.TokenizerI.span_tokenize

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Returns
iter(tuple(int, int))	Undocumented

def tokenize(self, s): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Return a tokenized copy of s.

Returns
list of str	Undocumented

_blanklines = (source) ¶

Undocumented