nltk.tokenize.nist.NISTTokenizer

class documentation

class NISTTokenizer(TokenizerI): (source)

This NIST tokenizer is sentence-based instead of the original paragraph-based tokenization from mteval-14.pl; The sentence-based tokenization is consistent with the other tokenizers available in NLTK.

>>> from nltk.tokenize.nist import NISTTokenizer
>>> nist = NISTTokenizer()
>>> s = "Good muffins cost $3.88 in New York."
>>> expected_lower = [u'good', u'muffins', u'cost', u'$', u'3.88', u'in', u'new', u'york', u'.']
>>> expected_cased = [u'Good', u'muffins', u'cost', u'$', u'3.88', u'in', u'New', u'York', u'.']
>>> nist.tokenize(s, lowercase=False) == expected_cased
True
>>> nist.tokenize(s, lowercase=True) == expected_lower  # Lowercased.
True

The international_tokenize() is the preferred function when tokenizing non-european text, e.g.

>>> from nltk.tokenize.nist import NISTTokenizer
>>> nist = NISTTokenizer()

# Input strings. >>> albb = u'Alibaba Group Holding Limited (Chinese: 阿里巴巴集团控股有限公司) us a Chinese e-commerce company...' >>> amz = u'Amazon.com, Inc. (/ˈæməzɒn/) is an American electronic commerce...' >>> rkt = u'Rakuten, Inc. (楽天株式会社 Rakuten Kabushiki-gaisha) is a Japanese electronic commerce and Internet company based in Tokyo.'

# Expected tokens. >>> expected_albb = [u'Alibaba', u'Group', u'Holding', u'Limited', u'(', u'Chinese', u':', u'阿里巴巴集团控股', u'有限公司', u')'] >>> expected_amz = [u'Amazon', u'.', u'com', u',', u'Inc', u'.', u'(', u'/', u'ˈæ', u'm'] >>> expected_rkt = [u'Rakuten', u',', u'Inc', u'.', u'(', u'楽天株式会社', u'Rakuten', u'Kabushiki', u'-', u'gaisha']

>>> nist.international_tokenize(albb)[:10] == expected_albb
True
>>> nist.international_tokenize(amz)[:10] == expected_amz
True
>>> nist.international_tokenize(rkt)[:10] == expected_rkt
True

# Doctest for patching issue #1926 >>> sent = u'this is a foo☄sentence.' >>> expected_sent = [u'this', u'is', u'a', u'foo', u'☄', u'sentence', u'.'] >>> nist.international_tokenize(sent) == expected_sent True

Method	`international_tokenize`	Undocumented
Method	`lang_independent_sub`	Performs the language independent string substituitions.
Method	`tokenize`	Return a tokenized copy of s.
Constant	`DASH_PRECEED_DIGIT`	Undocumented
Constant	`INTERNATIONAL_REGEXES`	Undocumented
Constant	`LANG_DEPENDENT_REGEXES`	Undocumented
Constant	`NONASCII`	Undocumented
Constant	`PERIOD_COMMA_FOLLOW`	Undocumented
Constant	`PERIOD_COMMA_PRECEED`	Undocumented
Constant	`PUNCT`	Undocumented
Constant	`PUNCT_1`	Undocumented
Constant	`PUNCT_2`	Undocumented
Constant	`STRIP_EOL_HYPHEN`	Undocumented
Constant	`STRIP_SKIP`	Undocumented
Constant	`SYMBOLS`	Undocumented
Class Variable	`number_regex`	Undocumented
Class Variable	`punct_regex`	Undocumented
Class Variable	`pup_number`	Undocumented
Class Variable	`pup_punct`	Undocumented
Class Variable	`pup_symbol`	Undocumented
Class Variable	`symbol_regex`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def international_tokenize(self, text, lowercase=False, split_non_ascii=True, return_str=False): (source) ¶

Undocumented

def lang_independent_sub(self, text): (source) ¶

Performs the language independent string substituitions.

def tokenize(self, text, lowercase=False, western_lang=True, return_str=False): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Return a tokenized copy of s.

Returns
list of str	Undocumented

DASH_PRECEED_DIGIT = (source) ¶

Undocumented

Value

(re.compile(r'([0-9])(-)'), '\\1 \\2 ')

INTERNATIONAL_REGEXES = (source) ¶

Undocumented

Value

[NONASCII, PUNCT_1, PUNCT_2, SYMBOLS]

LANG_DEPENDENT_REGEXES = (source) ¶

Undocumented

Value

[PUNCT, PERIOD_COMMA_PRECEED, PERIOD_COMMA_FOLLOW, DASH_PRECEED_DIGIT]

NONASCII = (source) ¶

Undocumented

Value

(re.compile(r'([\x00-\x7f]+)'), ' \\1 ')

PERIOD_COMMA_FOLLOW = (source) ¶

Undocumented

Value

(re.compile(r'([\.,])([^0-9])'), ' \\1 \\2')

PERIOD_COMMA_PRECEED = (source) ¶

Undocumented

Value

(re.compile(r'([^0-9])([\.,])'), '\\1 \\2 ')

PUNCT = (source) ¶

Undocumented

Value

(re.compile(r'([\{-~\[-` -&\(-\+:-@/])'), ' \\1 ')

PUNCT_1 = (source) ¶

Undocumented

Value

(re.compile("""([{n}])([{p}])""".format(n=number_regex, p=punct_regex)),
 '\\1 \\2 ')

PUNCT_2 = (source) ¶

Undocumented

Value

(re.compile("""([{p}])([{n}])""".format(n=number_regex, p=punct_regex)),
 ' \\1 \\2')

STRIP_EOL_HYPHEN = (source) ¶

Undocumented

Value

(re.compile(r'\u2028'), ' ')

STRIP_SKIP = (source) ¶

Undocumented

Value

(re.compile(r'<skipped>'), '')

SYMBOLS = (source) ¶

Undocumented

Value

(re.compile("""([{s}])""".format(s=symbol_regex)), ' \\1 ')

number_regex = (source) ¶

Undocumented

punct_regex = (source) ¶

Undocumented

pup_number = (source) ¶

Undocumented

pup_punct = (source) ¶

Undocumented

pup_symbol = (source) ¶

Undocumented

symbol_regex = (source) ¶

Undocumented