class documentation

A corpus reader used to access wordnet or its variants.

Method __init__ Construct a new wordnet corpus reader, with the given root directory.
Method all_lemma_names Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.
Method all_synsets Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.
Method citation Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language
Method custom_lemmas Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK's WordNet functions to then be used with that language.
Method get_version Undocumented
Method ic Creates an information content lookup dictionary from a corpus.
Method jcn_similarity Undocumented
Method langs return a list of languages supported by Multilingual Wordnet
Method lch_similarity Undocumented
Method lemma Return lemma object that matches the name
Method lemma_count Return the frequency count for this Lemma
Method lemma_from_key Undocumented
Method lemmas Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.
Method license Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language
Method lin_similarity Undocumented
Method morphy Find a possible base form for the given form, with the given part of speech, by checking WordNet's list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.
Method of2ss take an id and return the synsets
Method path_similarity Undocumented
Method readme Return the contents of README (for omw) use lang=lang to get the readme for an individual language
Method res_similarity Undocumented
Method ss2of return the ID of the synset
Method synset Undocumented
Method synset_from_pos_and_offset pos: The synset's part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB ('a', 's', 'r', 'n', or 'v').
Method synset_from_sense_key Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()
Method synsets Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.
Method words return lemmas of the given language as list of words
Method wup_similarity Undocumented
Constant MORPHOLOGICAL_SUBSTITUTIONS Undocumented
Class Variable ADJ Undocumented
Class Variable ADJ_SAT Undocumented
Class Variable ADV Undocumented
Class Variable NOUN Undocumented
Class Variable VERB Undocumented
Method _compute_max_depth Compute the max depth for the given part of speech. This is used by the lch similarity metric.
Method _data_file Return an open file pointer for the data file for the given part of speech.
Method _load_exception_map Undocumented
Method _load_lang_data load the wordnet data of the requested language from the file to the cache, _lang_data
Method _load_lemma_pos_offset_map Undocumented
Method _morphy Undocumented
Method _synset_from_pos_and_line Undocumented
Method _synset_from_pos_and_offset Hack to help people like the readers of http://stackoverflow.com/a/27145655/1709587 who were using this function before it was officially a public method
Constant _ENCODING Undocumented
Constant _FILEMAP Undocumented
Constant _FILES Undocumented
Class Variable _pos_names Undocumented
Class Variable _pos_numbers Undocumented
Instance Variable _data_file_map Undocumented
Instance Variable _exception_map Undocumented
Instance Variable _key_count_file Undocumented
Instance Variable _key_synset_file Undocumented
Instance Variable _lang_data Undocumented
Instance Variable _lemma_pos_offset_map Undocumented
Instance Variable _lexnames Undocumented
Instance Variable _max_depth Undocumented
Instance Variable _omw_reader Undocumented
Instance Variable _synset_offset_cache Undocumented

Inherited from CorpusReader:

Method __repr__ Undocumented
Method abspath Return the absolute path for the given file.
Method abspaths Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.
Method encoding Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.
Method ensure_loaded Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded -- e.g., in case a user wants to do help(some_corpus).
Method fileids Return a list of file identifiers for the fileids that make up this corpus.
Method open Return an open stream that can be used to read the given file. If the file's encoding is not None, then the stream will automatically decode the file's contents into unicode.
Class Variable root Undocumented
Method _get_root Undocumented
Instance Variable _encoding The default unicode encoding for the fileids that make up this corpus. If encoding is None, then the file contents are processed using byte strings.
Instance Variable _fileids A list of the relative paths for the fileids that make up this corpus.
Instance Variable _root The root directory for this corpus.
Instance Variable _tagset Undocumented
def __init__(self, root, omw_reader): (source)

Construct a new wordnet corpus reader, with the given root directory.

def all_lemma_names(self, pos=None, lang='eng'): (source)

Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.

def all_synsets(self, pos=None): (source)

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

def citation(self, lang='omw'): (source)

Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language

def custom_lemmas(self, tab_file, lang): (source)

Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK's WordNet functions to then be used with that language.

See the "Tab files" section at http://compling.hss.ntu.edu.sg/omw/ for documentation on the Multilingual WordNet tab file format.

:type lang str :param lang ISO 639-3 code of the language of the tab file

Parameters
tab_fileTab file as a file or file-like object
langUndocumented
def get_version(self): (source)

Undocumented

def ic(self, corpus, weight_senses_equally=False, smoothing=1.0): (source)

Creates an information content lookup dictionary from a corpus.

content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary

Parameters
corpus:CorpusReaderThe corpus from which we create an information
weight_senses_equallyUndocumented
smoothingUndocumented
def jcn_similarity(self, synset1, synset2, ic, verbose=False): (source)

Undocumented

def langs(self): (source)

return a list of languages supported by Multilingual Wordnet

def lch_similarity(self, synset1, synset2, verbose=False, simulate_root=True): (source)

Undocumented

def lemma(self, name, lang='eng'): (source)

Return lemma object that matches the name

def lemma_count(self, lemma): (source)

Return the frequency count for this Lemma

def lemma_from_key(self, key): (source)

Undocumented

def lemmas(self, lemma, pos=None, lang='eng'): (source)

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

def license(self, lang='eng'): (source)

Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language

def lin_similarity(self, synset1, synset2, ic, verbose=False): (source)

Undocumented

def morphy(self, form, pos=None, check_exceptions=True): (source)

Find a possible base form for the given form, with the given part of speech, by checking WordNet's list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)
def of2ss(self, of): (source)

take an id and return the synsets

def path_similarity(self, synset1, synset2, verbose=False, simulate_root=True): (source)

Undocumented

def readme(self, lang='omw'): (source)

Return the contents of README (for omw) use lang=lang to get the readme for an individual language

def res_similarity(self, synset1, synset2, ic, verbose=False): (source)

Undocumented

def ss2of(self, ss, lang=None): (source)

return the ID of the synset

def synset(self, name): (source)

Undocumented

def synset_from_pos_and_offset(self, pos, offset): (source)

  • pos: The synset's part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB ('a', 's', 'r', 'n', or 'v').
  • offset: The byte offset of this synset in the WordNet dict file for this pos.
>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_pos_and_offset('n', 1740))
Synset('entity.n.01')

def synset_from_sense_key(self, sense_key): (source)

Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()

From https://wordnet.princeton.edu/documentation/senseidx5wn: A sense_key is represented as:

lemma % lex_sense (e.g. 'dog%1:18:01::')
where lex_sense is encoded as:
ss_type:lex_filenum:lex_id:head_word:head_id

lemma: ASCII text of word/collocation, in lower case ss_type: synset type for the sense (1 digit int)

The synset type is encoded as follows: 1 NOUN 2 VERB 3 ADJECTIVE 4 ADVERB 5 ADJECTIVE SATELLITE

lex_filenum: name of lexicographer file containing the synset for the sense (2 digit int) lex_id: when paired with lemma, uniquely identifies a sense in the lexicographer file (2 digit int) head_word: lemma of the first word in satellite's head synset

Only used if sense is in an adjective satellite synset
head_id: uniquely identifies sense in a lexicographer file when paired with head_word
Only used if head_word is present (2 digit int)
>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_sense_key("drive%1:04:03::"))
Synset('drive.n.06')
>>> print(wn.synset_from_sense_key("driving%1:04:03::"))
Synset('drive.n.06')
def synsets(self, lemma, pos=None, lang='eng', check_exceptions=True): (source)

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

def words(self, lang='eng'): (source)

return lemmas of the given language as list of words

def wup_similarity(self, synset1, synset2, verbose=False, simulate_root=True): (source)

Undocumented

MORPHOLOGICAL_SUBSTITUTIONS = (source)

Undocumented

Value
{NOUN: [('s', ''),
        ('ses', 's'),
        ('ves', 'f'),
        ('xes', 'x'),
        ('zes', 'z'),
        ('ches', 'ch'),
        ('shes', 'sh'),
...

Undocumented

Undocumented

Undocumented

Undocumented

Undocumented

def _compute_max_depth(self, pos, simulate_root): (source)

Compute the max depth for the given part of speech. This is used by the lch similarity metric.

def _data_file(self, pos): (source)

Return an open file pointer for the data file for the given part of speech.

def _load_exception_map(self): (source)

Undocumented

def _load_lang_data(self, lang): (source)

load the wordnet data of the requested language from the file to the cache, _lang_data

def _load_lemma_pos_offset_map(self): (source)

Undocumented

def _morphy(self, form, pos, check_exceptions=True): (source)

Undocumented

def _synset_from_pos_and_line(self, pos, data_file_line): (source)

Undocumented

@deprecated('Use public method synset_from_pos_and_offset() instead')
def _synset_from_pos_and_offset(self, *args, **kwargs): (source)

Hack to help people like the readers of http://stackoverflow.com/a/27145655/1709587 who were using this function before it was officially a public method

_ENCODING: str = (source)

Undocumented

Value
'utf8'
_FILEMAP = (source)

Undocumented

Value
{ADJ: 'adj', ADV: 'adv', NOUN: 'noun', VERB: 'verb'}
_FILES: tuple[str, ...] = (source)

Undocumented

Value
('cntlist.rev',
 'lexnames',
 'index.sense',
 'index.adj',
 'index.adv',
 'index.noun',
 'index.verb',
...
_pos_names = (source)

Undocumented

_pos_numbers = (source)

Undocumented

_data_file_map: dict = (source)

Undocumented

_exception_map: dict = (source)

Undocumented

_key_count_file = (source)

Undocumented

_key_synset_file = (source)

Undocumented

_lang_data = (source)

Undocumented

_lemma_pos_offset_map = (source)

Undocumented

_lexnames: list = (source)

Undocumented

_max_depth = (source)

Undocumented

_omw_reader = (source)

Undocumented

_synset_offset_cache = (source)

Undocumented