nltk.tokenize.stanford.StanfordTokenizer

class documentation

class StanfordTokenizer(TokenizerI): (source)

Constructor: StanfordTokenizer(path_to_jar, encoding, options, verbose, java_options)

Interface to the Stanford Tokenizer

>>> from nltk.tokenize.stanford import StanfordTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks."
>>> StanfordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> s = "The colour of the wall is blue."
>>> StanfordTokenizer(options={"americanize": True}).tokenize(s)
['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']

Method	`__init__`	Undocumented
Method	`tokenize`	Use stanford tokenizer's PTBTokenizer to tokenize multiple sentences.
Instance Variable	`java_options`	Undocumented
Static Method	`_parse_tokenized_output`	Undocumented
Method	`_execute`	Undocumented
Constant	`_JAR`	Undocumented
Instance Variable	`_encoding`	Undocumented
Instance Variable	`_options_cmd`	Undocumented
Instance Variable	`_stanford_jar`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.: