class documentation
class StanfordSegmenter(TokenizerI): (source)
Constructor: StanfordSegmenter(path_to_jar, path_to_slf4j, java_class, path_to_model, ...)
Interface to the Stanford Segmenter
If stanford-segmenter version is older than 2016-10-31, then path_to_slf4j should be provieded, for example:
seg = StanfordSegmenter(path_to_slf4j='/YOUR_PATH/slf4j-api.jar')
>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter >>> seg = StanfordSegmenter() >>> seg.default_config('zh') >>> sent = u'这是斯坦福中文分词器测试' >>> print(seg.segment(sent)) 这 是 斯坦福 中文 分词器 测试 <BLANKLINE> >>> seg.default_config('ar') >>> sent = u'هذا هو تصنيف ستانفورد العربي للكلمات' >>> print(seg.segment(sent.split())) هذا هو تصنيف ستانفورد العربي ل الكلمات <BLANKLINE>
Method | __init__ |
Undocumented |
Method | default |
Attempt to intialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables |
Method | segment |
Undocumented |
Method | segment |
No summary |
Method | segment |
No summary |
Method | tokenize |
Return a tokenized copy of s. |
Instance Variable | java |
Undocumented |
Method | _execute |
Undocumented |
Constant | _JAR |
Undocumented |
Instance Variable | _dict |
Undocumented |
Instance Variable | _encoding |
Undocumented |
Instance Variable | _input |
Undocumented |
Instance Variable | _java |
Undocumented |
Instance Variable | _keep |
Undocumented |
Instance Variable | _model |
Undocumented |
Instance Variable | _options |
Undocumented |
Instance Variable | _sihan |
Undocumented |
Instance Variable | _sihan |
Undocumented |
Instance Variable | _stanford |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
def __init__(self, path_to_jar=None, path_to_slf4j=None, java_class=None, path_to_model=None, path_to_dict=None, path_to_sihan_corpora_dict=None, sihan_post_processing='false', keep_whitespaces='false', encoding='UTF-8', options=None, verbose=False, java_options='-mx2g'):
(source)
¶
Undocumented
Attempt to intialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables
overrides
nltk.tokenize.api.TokenizerI.tokenize
Return a tokenized copy of s.
Returns | |
list of str | Undocumented |