nltk.tokenize.sonority_sequencing.SyllableTokenizer

class documentation

class SyllableTokenizer(TokenizerI): (source)

Constructor: SyllableTokenizer(lang, sonority_hierarchy)

Syllabifies words based on the Sonority Sequencing Principle (SSP).

>>> from nltk.tokenize import SyllableTokenizer
>>> from nltk import word_tokenize
>>> SSP = SyllableTokenizer()
>>> SSP.tokenize('justification')
['jus', 'ti', 'fi', 'ca', 'tion']
>>> text = "This is a foobar-like sentence."
>>> [SSP.tokenize(token) for token in word_tokenize(text)]
[['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['.']]

Method	`__init__`	No summary
Method	`assign_values`	Assigns each phoneme its value from the sonority hierarchy. Note: Sentence/text has to be tokenized first.
Method	`tokenize`	Apply the SSP to return a list of syllables. Note: Sentence/text has to be tokenized first.
Method	`validate_syllables`	Ensures each syllable has at least one vowel. If the following syllable doesn't have vowel, add it to the current one.
Instance Variable	`phoneme_map`	Undocumented
Instance Variable	`vowels`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, lang='en', sonority_hierarchy=False): (source) ¶

Parameters
lang:str	Language parameter, default is English, 'en'
sonority_hierarchy:list(str)	Sonority hierarchy according to the Sonority Sequencing Principle.

def assign_values(self, token): (source) ¶

Assigns each phoneme its value from the sonority hierarchy. Note: Sentence/text has to be tokenized first.

Parameters
token:str	Single word or token
Returns
list(tuple(str, int))	List of tuples, first element is character/phoneme and second is the soronity value.

def tokenize(self, token): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Apply the SSP to return a list of syllables. Note: Sentence/text has to be tokenized first.

Parameters
token:str	Single word or token
Returns
list(str)	Single word or token broken up into syllables.

def validate_syllables(self, syllable_list): (source) ¶

Ensures each syllable has at least one vowel. If the following syllable doesn't have vowel, add it to the current one.

Parameters
syllable_list:list(str)	Single word or token broken up into syllables.
Returns
list(str)	Single word or token broken up into syllables (with added syllables if necessary)

phoneme_map: dict = (source) ¶

Undocumented

vowels = (source) ¶

Undocumented