«
class documentation

class NISTTokenizer(TokenizerI): (source)

View In Hierarchy

This NIST tokenizer is sentence-based instead of the original paragraph-based tokenization from mteval-14.pl; The sentence-based tokenization is consistent with the other tokenizers available in NLTK.

>>> from nltk.tokenize.nist import NISTTokenizer
>>> nist = NISTTokenizer()
>>> s = "Good muffins cost $3.88 in New York."
>>> expected_lower = [u'good', u'muffins', u'cost', u'$', u'3.88', u'in', u'new', u'york', u'.']
>>> expected_cased = [u'Good', u'muffins', u'cost', u'$', u'3.88', u'in', u'New', u'York', u'.']
>>> nist.tokenize(s, lowercase=False) == expected_cased
True
>>> nist.tokenize(s, lowercase=True) == expected_lower  # Lowercased.
True

The international_tokenize() is the preferred function when tokenizing non-european text, e.g.

>>> from nltk.tokenize.nist import NISTTokenizer
>>> nist = NISTTokenizer()

# Input strings. >>> albb = u'Alibaba Group Holding Limited (Chinese: 阿里巴巴集团控股 有限公司) us a Chinese e-commerce company...' >>> amz = u'Amazon.com, Inc. (/ˈæməzɒn/) is an American electronic commerce...' >>> rkt = u'Rakuten, Inc. (楽天株式会社 Rakuten Kabushiki-gaisha) is a Japanese electronic commerce and Internet company based in Tokyo.'

# Expected tokens. >>> expected_albb = [u'Alibaba', u'Group', u'Holding', u'Limited', u'(', u'Chinese', u':', u'阿里巴巴集团控股', u'有限公司', u')'] >>> expected_amz = [u'Amazon', u'.', u'com', u',', u'Inc', u'.', u'(', u'/', u'ˈæ', u'm'] >>> expected_rkt = [u'Rakuten', u',', u'Inc', u'.', u'(', u'楽天株式会社', u'Rakuten', u'Kabushiki', u'-', u'gaisha']

>>> nist.international_tokenize(albb)[:10] == expected_albb
True
>>> nist.international_tokenize(amz)[:10] == expected_amz
True
>>> nist.international_tokenize(rkt)[:10] == expected_rkt
True

# Doctest for patching issue #1926 >>> sent = u'this is a foo☄sentence.' >>> expected_sent = [u'this', u'is', u'a', u'foo', u'☄', u'sentence', u'.'] >>> nist.international_tokenize(sent) == expected_sent True

Method international_tokenize Undocumented
Method lang_independent_sub Performs the language independent string substituitions.
Method tokenize Return a tokenized copy of s.
Constant DASH_PRECEED_DIGIT Undocumented
Constant INTERNATIONAL_REGEXES Undocumented
Constant LANG_DEPENDENT_REGEXES Undocumented
Constant NONASCII Undocumented
Constant PERIOD_COMMA_FOLLOW Undocumented
Constant PERIOD_COMMA_PRECEED Undocumented
Constant PUNCT Undocumented
Constant PUNCT_1 Undocumented
Constant PUNCT_2 Undocumented
Constant STRIP_EOL_HYPHEN Undocumented
Constant STRIP_SKIP Undocumented
Constant SYMBOLS Undocumented
Class Variable number_regex Undocumented
Class Variable punct_regex Undocumented
Class Variable pup_number Undocumented
Class Variable pup_punct Undocumented
Class Variable pup_symbol Undocumented
Class Variable symbol_regex Undocumented

Inherited from TokenizerI:

Method span_tokenize Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.
Method span_tokenize_sents Apply self.span_tokenize() to each element of strings. I.e.:
Method tokenize_sents Apply self.tokenize() to each element of strings. I.e.:
def international_tokenize(self, text, lowercase=False, split_non_ascii=True, return_str=False): (source)

Undocumented

def lang_independent_sub(self, text): (source)

Performs the language independent string substituitions.

def tokenize(self, text, lowercase=False, western_lang=True, return_str=False): (source)

Return a tokenized copy of s.

Returns
list of strUndocumented
DASH_PRECEED_DIGIT = (source)

Undocumented

Value
(re.compile(r'([0-9])(-)'), '\\1 \\2 ')
INTERNATIONAL_REGEXES = (source)

Undocumented

Value
[NONASCII, PUNCT_1, PUNCT_2, SYMBOLS]
NONASCII = (source)

Undocumented

Value
(re.compile(r'([\x00-\x7f]+)'), ' \\1 ')
PERIOD_COMMA_FOLLOW = (source)

Undocumented

Value
(re.compile(r'([\.,])([^0-9])'), ' \\1 \\2')
PERIOD_COMMA_PRECEED = (source)

Undocumented

Value
(re.compile(r'([^0-9])([\.,])'), '\\1 \\2 ')

Undocumented

Value
(re.compile(r'([\{-~\[-` -&\(-\+:-@/])'), ' \\1 ')

Undocumented

Value
(re.compile("""([{n}])([{p}])""".format(n=number_regex, p=punct_regex)),
 '\\1 \\2 ')

Undocumented

Value
(re.compile("""([{p}])([{n}])""".format(n=number_regex, p=punct_regex)),
 ' \\1 \\2')
STRIP_EOL_HYPHEN = (source)

Undocumented

Value
(re.compile(r'\u2028'), ' ')
STRIP_SKIP = (source)

Undocumented

Value
(re.compile(r'<skipped>'), '')

Undocumented

Value
(re.compile("""([{s}])""".format(s=symbol_regex)), ' \\1 ')
number_regex = (source)

Undocumented

punct_regex = (source)

Undocumented

pup_number = (source)

Undocumented

pup_punct = (source)

Undocumented

pup_symbol = (source)

Undocumented

symbol_regex = (source)

Undocumented