nltk.tokenize.stanford_segmenter.StanfordSegmenter

class documentation

class StanfordSegmenter(TokenizerI): (source)

Constructor: StanfordSegmenter(path_to_jar, path_to_slf4j, java_class, path_to_model, ...)

Interface to the Stanford Segmenter

If stanford-segmenter version is older than 2016-10-31, then path_to_slf4j should be provieded, for example:

seg = StanfordSegmenter(path_to_slf4j='/YOUR_PATH/slf4j-api.jar')

>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> seg = StanfordSegmenter()
>>> seg.default_config('zh')
>>> sent = u'这是斯坦福中文分词器测试'
>>> print(seg.segment(sent))
这 是 斯坦福 中文 分词器 测试
<BLANKLINE>
>>> seg.default_config('ar')
>>> sent = u'هذا هو تصنيف ستانفورد العربي للكلمات'
>>> print(seg.segment(sent.split()))
هذا هو تصنيف ستانفورد العربي ل الكلمات
<BLANKLINE>

Method	`__init__`	Undocumented
Method	`default_config`	Attempt to intialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables
Method	`segment`	Undocumented
Method	`segment_file`	No summary
Method	`segment_sents`	No summary
Method	`tokenize`	Return a tokenized copy of s.
Instance Variable	`java_options`	Undocumented
Method	`_execute`	Undocumented
Constant	`_JAR`	Undocumented
Instance Variable	`_dict`	Undocumented
Instance Variable	`_encoding`	Undocumented
Instance Variable	`_input_file_path`	Undocumented
Instance Variable	`_java_class`	Undocumented
Instance Variable	`_keep_whitespaces`	Undocumented
Instance Variable	`_model`	Undocumented
Instance Variable	`_options_cmd`	Undocumented
Instance Variable	`_sihan_corpora_dict`	Undocumented
Instance Variable	`_sihan_post_processing`	Undocumented
Instance Variable	`_stanford_jar`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, path_to_jar=None, path_to_slf4j=None, java_class=None, path_to_model=None, path_to_dict=None, path_to_sihan_corpora_dict=None, sihan_post_processing='false', keep_whitespaces='false', encoding='UTF-8', options=None, verbose=False, java_options='-mx2g'): (source) ¶

Undocumented

def default_config(self, lang): (source) ¶

Attempt to intialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables

def segment(self, tokens): (source) ¶

Undocumented

def segment_file(self, input_file_path): (source) ¶

def segment_sents(self, sentences): (source) ¶

def tokenize(self, s): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Return a tokenized copy of s.