class documentation

Interface to the Stanford Tokenizer

>>> from nltk.tokenize.stanford import StanfordTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks."
>>> StanfordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> s = "The colour of the wall is blue."
>>> StanfordTokenizer(options={"americanize": True}).tokenize(s)
['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']
Method __init__ Undocumented
Method tokenize Use stanford tokenizer's PTBTokenizer to tokenize multiple sentences.
Instance Variable java_options Undocumented
Static Method _parse_tokenized_output Undocumented
Method _execute Undocumented
Constant _JAR Undocumented
Instance Variable _encoding Undocumented
Instance Variable _options_cmd Undocumented
Instance Variable _stanford_jar Undocumented

Inherited from TokenizerI:

Method span_tokenize Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.:
def __init__(self, path_to_jar=None, encoding='utf8', options=None, verbose=False, java_options='-mx1000m'): (source)

Undocumented

def tokenize(self, s): (source)

Use stanford tokenizer's PTBTokenizer to tokenize multiple sentences.

java_options = (source)

Undocumented

@staticmethod
def _parse_tokenized_output(s): (source)

Undocumented

def _execute(self, cmd, input_, verbose=False): (source)

Undocumented

_JAR: str = (source)

Undocumented

Value
'stanford-postagger.jar'
_encoding = (source)

Undocumented

_options_cmd = (source)

Undocumented

_stanford_jar = (source)

Undocumented