nltk.corpus.reader.WordNetCorpusReader

class documentation

class WordNetCorpusReader(CorpusReader): (source)

Constructor: WordNetCorpusReader(root, omw_reader)

A corpus reader used to access wordnet or its variants.

Method	`__init__`	Construct a new wordnet corpus reader, with the given root directory.
Method	`all_lemma_names`	Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.
Method	`all_synsets`	Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.
Method	`citation`	Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language
Method	`custom_lemmas`	Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK's WordNet functions to then be used with that language.
Method	`get_version`	Undocumented
Method	`ic`	Creates an information content lookup dictionary from a corpus.
Method	`jcn_similarity`	Undocumented
Method	`langs`	return a list of languages supported by Multilingual Wordnet
Method	`lch_similarity`	Undocumented
Method	`lemma`	Return lemma object that matches the name
Method	`lemma_count`	Return the frequency count for this Lemma
Method	`lemma_from_key`	Undocumented
Method	`lemmas`	Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.
Method	`license`	Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language
Method	`lin_similarity`	Undocumented
Method	`morphy`	Find a possible base form for the given form, with the given part of speech, by checking WordNet's list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.
Method	`of2ss`	take an id and return the synsets
Method	`path_similarity`	Undocumented
Method	`readme`	Return the contents of README (for omw) use lang=lang to get the readme for an individual language
Method	`res_similarity`	Undocumented
Method	`ss2of`	return the ID of the synset
Method	`synset`	Undocumented
Method	`synset_from_pos_and_offset`	pos: The synset's part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB ('a', 's', 'r', 'n', or 'v').
Method	`synset_from_sense_key`	Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()
Method	`synsets`	Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.
Method	`words`	return lemmas of the given language as list of words
Method	`wup_similarity`	Undocumented
Constant	`MORPHOLOGICAL_SUBSTITUTIONS`	Undocumented
Class Variable	`ADJ`	Undocumented
Class Variable	`ADJ_SAT`	Undocumented
Class Variable	`ADV`	Undocumented
Class Variable	`NOUN`	Undocumented
Class Variable	`VERB`	Undocumented
Method	`_compute_max_depth`	Compute the max depth for the given part of speech. This is used by the lch similarity metric.
Method	`_data_file`	Return an open file pointer for the data file for the given part of speech.
Method	`_load_exception_map`	Undocumented
Method	`_load_lang_data`	load the wordnet data of the requested language from the file to the cache, _lang_data
Method	`_load_lemma_pos_offset_map`	Undocumented
Method	`_morphy`	Undocumented
Method	`_synset_from_pos_and_line`	Undocumented
Method	`_synset_from_pos_and_offset`	Hack to help people like the readers of http://stackoverflow.com/a/27145655/1709587 who were using this function before it was officially a public method
Constant	`_ENCODING`	Undocumented
Constant	`_FILEMAP`	Undocumented
Constant	`_FILES`	Undocumented
Class Variable	`_pos_names`	Undocumented
Class Variable	`_pos_numbers`	Undocumented
Instance Variable	`_data_file_map`	Undocumented
Instance Variable	`_exception_map`	Undocumented
Instance Variable	`_key_count_file`	Undocumented
Instance Variable	`_key_synset_file`	Undocumented
Instance Variable	`_lang_data`	Undocumented
Instance Variable	`_lemma_pos_offset_map`	Undocumented
Instance Variable	`_lexnames`	Undocumented
Instance Variable	`_max_depth`	Undocumented
Instance Variable	`_omw_reader`	Undocumented
Instance Variable	`_synset_offset_cache`	Undocumented

Inherited from CorpusReader:

Method	`__repr__`	Undocumented
Method	`abspath`	Return the absolute path for the given file.
Method	`abspaths`	Return a list of the absolute paths for all fileids in this corpus; or for the given list of fileids, if specified.
Method	`encoding`	Return the unicode encoding for the given corpus file, if known. If the encoding is unknown, or if the given file should be processed using byte strings (str), then return None.
Method	`ensure_loaded`	Load this corpus (if it has not already been loaded). This is used by LazyCorpusLoader as a simple method that can be used to make sure a corpus is loaded -- e.g., in case a user wants to do help(some_corpus).
Method	`fileids`	Return a list of file identifiers for the fileids that make up this corpus.
Method	`open`	Return an open stream that can be used to read the given file. If the file's encoding is not None, then the stream will automatically decode the file's contents into unicode.
Class Variable	`root`	Undocumented
Method	`_get_root`	Undocumented
Instance Variable	`_encoding`	The default unicode encoding for the fileids that make up this corpus. If `encoding` is None, then the file contents are processed using byte strings.
Instance Variable	`_fileids`	A list of the relative paths for the fileids that make up this corpus.
Instance Variable	`_root`	The root directory for this corpus.
Instance Variable	`_tagset`	Undocumented

def __init__(self, root, omw_reader): (source) ¶

overrides nltk.corpus.reader.CorpusReader.__init__

Construct a new wordnet corpus reader, with the given root directory.

def all_lemma_names(self, pos=None, lang='eng'): (source) ¶

Return all lemma names for all synsets for the given part of speech tag and language or languages. If pos is not specified, all synsets for all parts of speech will be used.

def all_synsets(self, pos=None): (source) ¶

Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded.

def citation(self, lang='omw'): (source) ¶

overrides nltk.corpus.reader.CorpusReader.citation

Return the contents of citation.bib file (for omw) use lang=lang to get the citation for an individual language

def custom_lemmas(self, tab_file, lang): (source) ¶

Reads a custom tab file containing mappings of lemmas in the given language to Princeton WordNet 3.0 synset offsets, allowing NLTK's WordNet functions to then be used with that language.

See the "Tab files" section at http://compling.hss.ntu.edu.sg/omw/ for documentation on the Multilingual WordNet tab file format.

:type lang str :param lang ISO 639-3 code of the language of the tab file

Parameters
tab_file	Tab file as a file or file-like object
lang	Undocumented

def get_version(self): (source) ¶

Undocumented

def ic(self, corpus, weight_senses_equally=False, smoothing=1.0): (source) ¶

Creates an information content lookup dictionary from a corpus.

content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary

Parameters
corpus:CorpusReader	The corpus from which we create an information
weight_senses_equally	Undocumented
smoothing	Undocumented

def jcn_similarity(self, synset1, synset2, ic, verbose=False): (source) ¶

Undocumented

def langs(self): (source) ¶

return a list of languages supported by Multilingual Wordnet

def lch_similarity(self, synset1, synset2, verbose=False, simulate_root=True): (source) ¶

Undocumented

def lemma(self, name, lang='eng'): (source) ¶

Return lemma object that matches the name

def lemma_count(self, lemma): (source) ¶

Return the frequency count for this Lemma

def lemma_from_key(self, key): (source) ¶

Undocumented

def lemmas(self, lemma, pos=None, lang='eng'): (source) ¶

Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.

def license(self, lang='eng'): (source) ¶

overrides nltk.corpus.reader.CorpusReader.license

Return the contents of LICENSE (for omw) use lang=lang to get the license for an individual language

def lin_similarity(self, synset1, synset2, ic, verbose=False): (source) ¶

Undocumented

def morphy(self, form, pos=None, check_exceptions=True): (source) ¶

Find a possible base form for the given form, with the given part of speech, by checking WordNet's list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.morphy('dogs'))
dog
>>> print(wn.morphy('churches'))
church
>>> print(wn.morphy('aardwolves'))
aardwolf
>>> print(wn.morphy('abaci'))
abacus
>>> wn.morphy('hardrock', wn.ADV)
>>> print(wn.morphy('book', wn.NOUN))
book
>>> wn.morphy('book', wn.ADJ)

def of2ss(self, of): (source) ¶

take an id and return the synsets

def path_similarity(self, synset1, synset2, verbose=False, simulate_root=True): (source) ¶

Undocumented

def readme(self, lang='omw'): (source) ¶

overrides nltk.corpus.reader.CorpusReader.readme

Return the contents of README (for omw) use lang=lang to get the readme for an individual language

def res_similarity(self, synset1, synset2, ic, verbose=False): (source) ¶

Undocumented

def ss2of(self, ss, lang=None): (source) ¶

return the ID of the synset

def synset(self, name): (source) ¶

Undocumented

def synset_from_pos_and_offset(self, pos, offset): (source) ¶

pos: The synset's part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB ('a', 's', 'r', 'n', or 'v').
offset: The byte offset of this synset in the WordNet dict file for this pos.

>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_pos_and_offset('n', 1740))
Synset('entity.n.01')

def synset_from_sense_key(self, sense_key): (source) ¶

Retrieves synset based on a given sense_key. Sense keys can be obtained from lemma.key()

From https://wordnet.princeton.edu/documentation/senseidx5wn: A sense_key is represented as:

lemma % lex_sense (e.g. 'dog%1:18:01::')

where lex_sense is encoded as:: ss_type:lex_filenum:lex_id:head_word:head_id

lemma: ASCII text of word/collocation, in lower case ss_type: synset type for the sense (1 digit int)

The synset type is encoded as follows: 1 NOUN 2 VERB 3 ADJECTIVE 4 ADVERB 5 ADJECTIVE SATELLITE

lex_filenum: name of lexicographer file containing the synset for the sense (2 digit int) lex_id: when paired with lemma, uniquely identifies a sense in the lexicographer file (2 digit int) head_word: lemma of the first word in satellite's head synset

Only used if sense is in an adjective satellite synset

head_id: uniquely identifies sense in a lexicographer file when paired with head_word: Only used if head_word is present (2 digit int)

>>> import nltk
>>> from nltk.corpus import wordnet as wn
>>> print(wn.synset_from_sense_key("drive%1:04:03::"))
Synset('drive.n.06')

>>> print(wn.synset_from_sense_key("driving%1:04:03::"))
Synset('drive.n.06')

def synsets(self, lemma, pos=None, lang='eng', check_exceptions=True): (source) ¶

Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. If lang is specified, all the synsets associated with the lemma name of that language will be returned.

def words(self, lang='eng'): (source) ¶

return lemmas of the given language as list of words

def wup_similarity(self, synset1, synset2, verbose=False, simulate_root=True): (source) ¶

Undocumented

MORPHOLOGICAL_SUBSTITUTIONS = (source) ¶

Undocumented

Value

{NOUN: [('s', ''),
        ('ses', 's'),
        ('ves', 'f'),
        ('xes', 'x'),
        ('zes', 'z'),
        ('ches', 'ch'),
        ('shes', 'sh'),
...

ADJ = (source) ¶

Undocumented

ADJ_SAT = (source) ¶

Undocumented

ADV = (source) ¶

Undocumented

NOUN = (source) ¶

Undocumented

VERB = (source) ¶

Undocumented

def _compute_max_depth(self, pos, simulate_root): (source) ¶

Compute the max depth for the given part of speech. This is used by the lch similarity metric.

def _data_file(self, pos): (source) ¶

Return an open file pointer for the data file for the given part of speech.

def _load_exception_map(self): (source) ¶

Undocumented

def _load_lang_data(self, lang): (source) ¶

load the wordnet data of the requested language from the file to the cache, _lang_data

def _load_lemma_pos_offset_map(self): (source) ¶

Undocumented

def _morphy(self, form, pos, check_exceptions=True): (source) ¶

Undocumented

def _synset_from_pos_and_line(self, pos, data_file_line): (source) ¶

Undocumented

@deprecated('Use public method synset_from_pos_and_offset() instead')
def _synset_from_pos_and_offset(self, *args, **kwargs): (source) ¶

Hack to help people like the readers of http://stackoverflow.com/a/27145655/1709587 who were using this function before it was officially a public method

_ENCODING: str = (source) ¶

Undocumented

Value

'utf8'

_FILEMAP = (source) ¶

Undocumented

Value

{ADJ: 'adj', ADV: 'adv', NOUN: 'noun', VERB: 'verb'}

_FILES: tuple[str, ...] = (source) ¶

Undocumented

Value

('cntlist.rev',
 'lexnames',
 'index.sense',
 'index.adj',
 'index.adv',
 'index.noun',
 'index.verb',
...

_pos_names = (source) ¶

Undocumented

_pos_numbers = (source) ¶

Undocumented

_data_file_map: dict = (source) ¶

Undocumented

_exception_map: dict = (source) ¶

Undocumented

_key_count_file = (source) ¶

Undocumented

_key_synset_file = (source) ¶

Undocumented

_lang_data = (source) ¶

Undocumented

_lemma_pos_offset_map = (source) ¶

Undocumented

_lexnames: list = (source) ¶

Undocumented

_max_depth = (source) ¶

Undocumented

_omw_reader = (source) ¶

Undocumented

_synset_offset_cache = (source) ¶

Undocumented