class documentation

Interface to the Stanford Segmenter

If stanford-segmenter version is older than 2016-10-31, then path_to_slf4j should be provieded, for example:

seg = StanfordSegmenter(path_to_slf4j='/YOUR_PATH/slf4j-api.jar')
>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter
>>> seg = StanfordSegmenter()
>>> seg.default_config('zh')
>>> sent = u'这是斯坦福中文分词器测试'
>>> print(seg.segment(sent))
这 是 斯坦福 中文 分词器 测试
<BLANKLINE>
>>> seg.default_config('ar')
>>> sent = u'هذا هو تصنيف ستانفورد العربي للكلمات'
>>> print(seg.segment(sent.split()))
هذا هو تصنيف ستانفورد العربي ل الكلمات
<BLANKLINE>
Method __init__ Undocumented
Method default_config Attempt to intialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables
Method segment Undocumented
Method segment_file No summary
Method segment_sents No summary
Method tokenize Return a tokenized copy of s.
Instance Variable java_options Undocumented
Method _execute Undocumented
Constant _JAR Undocumented
Instance Variable _dict Undocumented
Instance Variable _encoding Undocumented
Instance Variable _input_file_path Undocumented
Instance Variable _java_class Undocumented
Instance Variable _keep_whitespaces Undocumented
Instance Variable _model Undocumented
Instance Variable _options_cmd Undocumented
Instance Variable _sihan_corpora_dict Undocumented
Instance Variable _sihan_post_processing Undocumented
Instance Variable _stanford_jar Undocumented

Inherited from TokenizerI:

Method span_tokenize Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.:
def __init__(self, path_to_jar=None, path_to_slf4j=None, java_class=None, path_to_model=None, path_to_dict=None, path_to_sihan_corpora_dict=None, sihan_post_processing='false', keep_whitespaces='false', encoding='UTF-8', options=None, verbose=False, java_options='-mx2g'): (source)

Undocumented

def default_config(self, lang): (source)

Attempt to intialize Stanford Word Segmenter for the specified language using the STANFORD_SEGMENTER and STANFORD_MODELS environment variables

def segment(self, tokens): (source)

Undocumented

def segment_file(self, input_file_path): (source)
def segment_sents(self, sentences): (source)
def tokenize(self, s): (source)

Return a tokenized copy of s.

Returns
list of strUndocumented
java_options = (source)

Undocumented

def _execute(self, cmd, verbose=False): (source)

Undocumented

_JAR: str = (source)

Undocumented

Value
'stanford-segmenter.jar'

Undocumented

_encoding = (source)

Undocumented

_input_file_path = (source)

Undocumented

_java_class: str = (source)

Undocumented

_keep_whitespaces = (source)

Undocumented

Undocumented

_options_cmd = (source)

Undocumented

_sihan_corpora_dict = (source)

Undocumented

_sihan_post_processing: str = (source)

Undocumented

_stanford_jar = (source)

Undocumented