class documentation
class LegalitySyllableTokenizer(TokenizerI): (source)
Constructor: LegalitySyllableTokenizer(tokenized_source_text, vowels, legal_frequency_threshold)
Syllabifies words based on the Legality Principle and Onset Maximization.
>>> from nltk.tokenize import LegalitySyllableTokenizer >>> from nltk import word_tokenize >>> from nltk.corpus import words >>> text = "This is a wonderful sentence." >>> text_words = word_tokenize(text) >>> LP = LegalitySyllableTokenizer(words.words()) >>> [LP.tokenize(word) for word in text_words] [['This'], ['is'], ['a'], ['won', 'der', 'ful'], ['sen', 'ten', 'ce'], ['.']]
Method | __init__ |
No summary |
Method | find |
Gathers all onsets and then return only those above the frequency threshold |
Method | onset |
Returns consonant cluster of word, i.e. all characters until the first vowel. |
Method | tokenize |
Apply the Legality Principle in combination with Onset Maximization to return a list of syllables. |
Instance Variable | legal |
Undocumented |
Instance Variable | legal |
Undocumented |
Instance Variable | vowels |
Undocumented |
Inherited from TokenizerI
:
Method | span |
Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token. |
Method | span |
Apply self.span_tokenize() to each element of strings. I.e.: |
Method | tokenize |
Apply self.tokenize() to each element of strings. I.e.: |
def __init__(self, tokenized_source_text, vowels='aeiouy', legal_frequency_threshold=0.001):
(source)
¶
Parameters | |
tokenized | List of valid tokens in the language |
vowels:str | Valid vowels in language or IPA represenation |
legal | Lowest frequency of all onsets to be considered a legal onset |
Gathers all onsets and then return only those above the frequency threshold
Parameters | |
words:list(str) | List of words in a language |
Returns | |
set(str) | Set of legal onsets |
Returns consonant cluster of word, i.e. all characters until the first vowel.
Parameters | |
word:str | Single word or token |
Returns | |
str | String of characters of onset |
overrides
nltk.tokenize.api.TokenizerI.tokenize
Apply the Legality Principle in combination with Onset Maximization to return a list of syllables.
Parameters | |
token:str | Single word or token |
Returns | |
list(str) | Single word or token broken up into syllables. |