class documentation
class SyllableTokenizer(TokenizerI): (source)
Constructor: SyllableTokenizer(lang, sonority_hierarchy)
Syllabifies words based on the Sonority Sequencing Principle (SSP).
>>> from nltk.tokenize import SyllableTokenizer >>> from nltk import word_tokenize >>> SSP = SyllableTokenizer() >>> SSP.tokenize('justification') ['jus', 'ti', 'fi', 'ca', 'tion'] >>> text = "This is a foobar-like sentence." >>> [SSP.tokenize(token) for token in word_tokenize(text)] [['This'], ['is'], ['a'], ['foo', 'bar', '-', 'li', 'ke'], ['sen', 'ten', 'ce'], ['.']]
| Method | __init__ |
No summary |
| Method | assign |
Assigns each phoneme its value from the sonority hierarchy. Note: Sentence/text has to be tokenized first. |
| Method | tokenize |
Apply the SSP to return a list of syllables. Note: Sentence/text has to be tokenized first. |
| Method | validate |
Ensures each syllable has at least one vowel. If the following syllable doesn't have vowel, add it to the current one. |
| Instance Variable | phoneme |
Undocumented |
| Instance Variable | vowels |
Undocumented |
Inherited from TokenizerI:
| Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
| Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
| Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
| Parameters | |
| lang:str | Language parameter, default is English, 'en' |
| sonority | Sonority hierarchy according to the Sonority Sequencing Principle. |
Assigns each phoneme its value from the sonority hierarchy. Note: Sentence/text has to be tokenized first.
| Parameters | |
| token:str | Single word or token |
| Returns | |
| list(tuple(str, int)) | List of tuples, first element is character/phoneme and second is the soronity value. |
overrides
nltk.tokenize.api.TokenizerI.tokenizeApply the SSP to return a list of syllables. Note: Sentence/text has to be tokenized first.
| Parameters | |
| token:str | Single word or token |
| Returns | |
| list(str) | Single word or token broken up into syllables. |