class documentation
class LegalitySyllableTokenizer(TokenizerI): (source)
Constructor: LegalitySyllableTokenizer(tokenized_source_text, vowels, legal_frequency_threshold)
Syllabifies words based on the Legality Principle and Onset Maximization.
>>> from nltk.tokenize import LegalitySyllableTokenizer >>> from nltk import word_tokenize >>> from nltk.corpus import words >>> text = "This is a wonderful sentence." >>> text_words = word_tokenize(text) >>> LP = LegalitySyllableTokenizer(words.words()) >>> [LP.tokenize(word) for word in text_words] [['This'], ['is'], ['a'], ['won', 'der', 'ful'], ['sen', 'ten', 'ce'], ['.']]
| Method | __init__ |
No summary |
| Method | find |
Gathers all onsets and then return only those above the frequency threshold |
| Method | onset |
Returns consonant cluster of word, i.e. all characters until the first vowel. |
| Method | tokenize |
Apply the Legality Principle in combination with Onset Maximization to return a list of syllables. |
| Instance Variable | legal |
Undocumented |
| Instance Variable | legal |
Undocumented |
| Instance Variable | vowels |
Undocumented |
Inherited from TokenizerI:
| Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
| Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
| Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
def __init__(self, tokenized_source_text, vowels='aeiouy', legal_frequency_threshold=0.001):
(source)
¶
| Parameters | |
| tokenized | List of valid tokens in the language |
| vowels:str | Valid vowels in language or IPA represenation |
| legal | Lowest frequency of all onsets to be considered a legal onset |
Gathers all onsets and then return only those above the frequency threshold
| Parameters | |
| words:list(str) | List of words in a language |
| Returns | |
| set(str) | Set of legal onsets |
Returns consonant cluster of word, i.e. all characters until the first vowel.
| Parameters | |
| word:str | Single word or token |
| Returns | |
| str | String of characters of onset |
overrides
nltk.tokenize.api.TokenizerI.tokenizeApply the Legality Principle in combination with Onset Maximization to return a list of syllables.
| Parameters | |
| token:str | Single word or token |
| Returns | |
| list(str) | Single word or token broken up into syllables. |