class documentation
class StanfordTokenizer(TokenizerI): (source)
Constructor: StanfordTokenizer(path_to_jar, encoding, options, verbose, java_options)
Interface to the Stanford Tokenizer
>>> from nltk.tokenize.stanford import StanfordTokenizer >>> s = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\nThanks." >>> StanfordTokenizer().tokenize(s) ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] >>> s = "The colour of the wall is blue." >>> StanfordTokenizer(options={"americanize": True}).tokenize(s) ['The', 'color', 'of', 'the', 'wall', 'is', 'blue', '.']
Method | __init__ |
Undocumented |
Method | tokenize |
Use stanford tokenizer's PTBTokenizer to tokenize multiple sentences. |
Instance Variable | java |
Undocumented |
Static Method | _parse |
Undocumented |
Method | _execute |
Undocumented |
Constant | _JAR |
Undocumented |
Instance Variable | _encoding |
Undocumented |
Instance Variable | _options |
Undocumented |
Instance Variable | _stanford |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
def __init__(self, path_to_jar=None, encoding='utf8', options=None, verbose=False, java_options='-mx1000m'):
(source)
¶
Undocumented
overrides
nltk.tokenize.api.TokenizerI.tokenize
Use stanford tokenizer's PTBTokenizer to tokenize multiple sentences.