nltk.tokenize.legality_principle.LegalitySyllableTokenizer

class documentation

class LegalitySyllableTokenizer(TokenizerI): (source)

Constructor: LegalitySyllableTokenizer(tokenized_source_text, vowels, legal_frequency_threshold)

Syllabifies words based on the Legality Principle and Onset Maximization.

>>> from nltk.tokenize import LegalitySyllableTokenizer
>>> from nltk import word_tokenize
>>> from nltk.corpus import words
>>> text = "This is a wonderful sentence."
>>> text_words = word_tokenize(text)
>>> LP = LegalitySyllableTokenizer(words.words())
>>> [LP.tokenize(word) for word in text_words]
[['This'], ['is'], ['a'], ['won', 'der', 'ful'], ['sen', 'ten', 'ce'], ['.']]

Method	`__init__`	No summary
Method	`find_legal_onsets`	Gathers all onsets and then return only those above the frequency threshold
Method	`onset`	Returns consonant cluster of word, i.e. all characters until the first vowel.
Method	`tokenize`	Apply the Legality Principle in combination with Onset Maximization to return a list of syllables.
Instance Variable	`legal_frequency_threshold`	Undocumented
Instance Variable	`legal_onsets`	Undocumented
Instance Variable	`vowels`	Undocumented

Inherited from TokenizerI:

Method	`span_tokenize`	Identify the tokens using integer offsets `(start_i, end_i)`, where `s[start_i:end_i]` is the corresponding token.
Method	`span_tokenize_sents`	Apply `self.span_tokenize()` to each element of `strings`. I.e.:
Method	`tokenize_sents`	Apply `self.tokenize()` to each element of `strings`. I.e.:

def __init__(self, tokenized_source_text, vowels='aeiouy', legal_frequency_threshold=0.001): (source) ¶

Parameters
tokenized_source_text:list(str)	List of valid tokens in the language
vowels:str	Valid vowels in language or IPA represenation
legal_frequency_threshold:float	Lowest frequency of all onsets to be considered a legal onset

def find_legal_onsets(self, words): (source) ¶

Gathers all onsets and then return only those above the frequency threshold

Parameters
words:list(str)	List of words in a language
Returns
set(str)	Set of legal onsets

def onset(self, word): (source) ¶

Returns consonant cluster of word, i.e. all characters until the first vowel.

Parameters
word:str	Single word or token
Returns
str	String of characters of onset

def tokenize(self, token): (source) ¶

overrides nltk.tokenize.api.TokenizerI.tokenize

Apply the Legality Principle in combination with Onset Maximization to return a list of syllables.

Parameters
token:str	Single word or token
Returns
list(str)	Single word or token broken up into syllables.

legal_frequency_threshold = (source) ¶

Undocumented

legal_onsets = (source) ¶

Undocumented

vowels = (source) ¶

Undocumented