module documentation

Functions to find and load NLTK resource files, such as corpora, grammars, and saved processing objects. Resource files are identified using URLs, such as nltk:corpora/abc/rural.txt or http://nltk.org/sample/toy.cfg. The following URL protocols are supported:

  • file:path: Specifies the file whose path is path. Both relative and absolute paths may be used.
  • http://host/path: Specifies the file stored on the web server host at path path.
  • nltk:path: Specifies the file stored in the NLTK data package at path. NLTK will search for these files in the directories specified by nltk.data.path.

If no protocol is specified, then the default protocol nltk: will be used.

This module provides to functions that can be used to access a resource file, given its URL: load() loads a given resource, and adds it to a resource cache; and retrieve() copies a given resource to a local file.

Class BufferedGzipFile A GzipFile subclass for compatibility with older nltk releases.
Class FileSystemPathPointer A path pointer that identifies a file which can be accessed directly via a given absolute path.
Class GzipFileSystemPathPointer A subclass of FileSystemPathPointer that identifies a gzip-compressed file located at a given absolute path. GzipFileSystemPathPointer is appropriate for loading large gzip-compressed pickle objects efficiently.
Class LazyLoader Undocumented
Class OpenOnDemandZipFile A subclass of zipfile.ZipFile that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile. This is useful for reducing the number of open file handles when many zip files are being accessed at once...
Class PathPointer An abstract base class for 'path pointers,' used by NLTK's data package to identify specific paths. Two subclasses exist: FileSystemPathPointer identifies a file that can be accessed directly via a given absolute path...
Class SeekableUnicodeStreamReader A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly. This is in contrast to codecs.StreamReader...
Class ZipFilePathPointer A path pointer that identifies a file contained within a zipfile, which can be accessed by reading that zipfile.
Function clear_cache Remove all objects from the resource cache. :see: load()
Function find Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a ...
Function gzip_open_unicode Undocumented
Function load Load a given resource from the NLTK data package. The following resource formats are currently supported:
Function normalize_resource_name No summary
Function normalize_resource_url Normalizes a resource url
Function retrieve Copy the given resource to a local file. If no filename is specified, then use the URL's filename. If there is already a file named filename, then raise a ValueError.
Function show_cfg Write out a grammar file, ignoring escaped and empty lines.
Function split_resource_url Splits a resource url into "<protocol>:<path>".
Constant AUTO_FORMATS Undocumented
Constant FORMATS Undocumented
Variable path A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e...
Variable textwrap_indent Undocumented
Function _open Helper function that returns an open file object for a resource, given its resource URL. If the given resource URL uses the "nltk:" protocol, or uses no protocol, then use nltk.data.find to find its path, and open it with the given mode; if the resource URL uses the 'file' protocol, then open the file with the given mode; otherwise, delegate to ...
Variable _paths_from_env Undocumented
Variable _resource_cache A dictionary used to cache resources so that they won't need to be loaded more than once.
def clear_cache(): (source)

Remove all objects from the resource cache. :see: load()

def find(resource_name, paths=None): (source)

Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a LookupError, whose message gives a pointer to the installation instructions for the NLTK downloader.

Zip File Handling:

  • If resource_name contains a component with a .zip extension, then it is assumed to be a zipfile; and the remaining path components are used to look inside the zipfile.
  • If any element of nltk.data.path has a .zip extension, then it is assumed to be a zipfile.
  • If a given resource name that does not contain any zipfile component is not found initially, then find() will make a second attempt to find that resource, by replacing each component p in the path with p.zip/p. For example, this allows find() to map the resource name corpora/chat80/cities.pl to a zip file path pointer to corpora/chat80.zip/chat80/cities.pl.
  • When using find() to locate a directory contained in a zipfile, the resource name must end with the forward slash character. Otherwise, find() will not locate the directory.
Parameters
resource_name:str or unicodeThe name of the resource to search for. Resource names are posix-style relative path names, such as corpora/brown. Directory names will be automatically converted to a platform-appropriate path separator.
pathsUndocumented
Returns
strUndocumented
def gzip_open_unicode(filename, mode='rb', compresslevel=9, encoding='utf-8', fileobj=None, errors=None, newline=None): (source)

Undocumented

def load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None): (source)

Load a given resource from the NLTK data package. The following resource formats are currently supported:

  • pickle
  • json
  • yaml
  • cfg (context free grammars)
  • pcfg (probabilistic CFGs)
  • fcfg (feature-based CFGs)
  • fol (formulas of First Order Logic)
  • logic (Logical formulas to be parsed by the given logic_parser)
  • val (valuation of First Order Logic model)
  • text (the file contents as a unicode string)
  • raw (the raw file contents as a byte string)

If no format is specified, load() will attempt to determine a format based on the resource name's file extension. If that fails, load() will raise a ValueError exception.

For all text formats (everything except pickle, json, yaml and raw), it tries to decode the raw contents using UTF-8, and if that doesn't work, it tries with ISO-8859-1 (Latin-1), unless the encoding is specified.

Parameters
resource_url:strA URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
formatUndocumented
cache:boolIf true, add this resource to a cache. If load() finds a resource in its cache, then it will return it from the cache rather than loading it.
verbose:boolIf true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache.
logic_parser:LogicParserThe parser that will be used to parse logical expressions.
fstruct_reader:FeatStructReaderThe parser that will be used to parse the feature structure of an fcfg.
encoding:strthe encoding of the input; only used for text formats.
def normalize_resource_name(resource_name, allow_relative=True, relative_path=None): (source)

>>> windows = sys.platform.startswith('win')
>>> normalize_resource_name('.', True)
'./'
>>> normalize_resource_name('./', True)
'./'
>>> windows or normalize_resource_name('dir/file', False, '/') == '/dir/file'
True
>>> not windows or normalize_resource_name('C:/file', False, '/') == '/C:/file'
True
>>> windows or normalize_resource_name('/dir/file', False, '/') == '/dir/file'
True
>>> windows or normalize_resource_name('../dir/file', False, '/') == '/dir/file'
True
>>> not windows or normalize_resource_name('/dir/file', True, '/') == 'dir/file'
True
>>> windows or normalize_resource_name('/dir/file', True, '/') == '/dir/file'
True

Parameters
resource_name:str or unicodeThe name of the resource to search for. Resource names are posix-style relative path names, such as corpora/brown. Directory names will automatically be converted to a platform-appropriate path separator. Directory trailing slashes are preserved
allow_relativeUndocumented
relative_pathUndocumented
def normalize_resource_url(resource_url): (source)

Normalizes a resource url

>>> windows = sys.platform.startswith('win')
>>> os.path.normpath(split_resource_url(normalize_resource_url('file:grammar.fcfg'))[1]) == \
... ('\\' if windows else '') + os.path.abspath(os.path.join(os.curdir, 'grammar.fcfg'))
True
>>> not windows or normalize_resource_url('file:C:/dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('file:C:\\dir\\file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('file:C:\\dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('file://C:/dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('file:////C:/dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('nltk:C:/dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('nltk:C:\\dir\\file') == 'file:///C:/dir/file'
True
>>> windows or normalize_resource_url('file:/dir/file/toy.cfg') == 'file:///dir/file/toy.cfg'
True
>>> normalize_resource_url('nltk:home/nltk')
'nltk:home/nltk'
>>> windows or normalize_resource_url('nltk:/home/nltk') == 'file:///home/nltk'
True
>>> normalize_resource_url('http://example.com/dir/file')
'http://example.com/dir/file'
>>> normalize_resource_url('dir/file')
'nltk:dir/file'
def retrieve(resource_url, filename=None, verbose=True): (source)

Copy the given resource to a local file. If no filename is specified, then use the URL's filename. If there is already a file named filename, then raise a ValueError.

Parameters
resource_url:strA URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
filenameUndocumented
verboseUndocumented
def show_cfg(resource_url, escape='##'): (source)

Write out a grammar file, ignoring escaped and empty lines.

Parameters
resource_url:strA URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
escape:strPrepended string that signals lines to be ignored
def split_resource_url(resource_url): (source)

Splits a resource url into "<protocol>:<path>".

>>> windows = sys.platform.startswith('win')
>>> split_resource_url('nltk:home/nltk')
('nltk', 'home/nltk')
>>> split_resource_url('nltk:/home/nltk')
('nltk', '/home/nltk')
>>> split_resource_url('file:/home/nltk')
('file', '/home/nltk')
>>> split_resource_url('file:///home/nltk')
('file', '/home/nltk')
>>> split_resource_url('file:///C:/home/nltk')
('file', '/C:/home/nltk')
AUTO_FORMATS: dict[str, str] = (source)

Undocumented

Value
{'pickle': 'pickle',
 'json': 'json',
 'yaml': 'yaml',
 'cfg': 'cfg',
 'pcfg': 'pcfg',
 'fcfg': 'fcfg',
 'fol': 'fol',
...
FORMATS: dict[str, str] = (source)

Undocumented

Value
{'pickle': 'A serialized python object, stored using the pickle module.',
 'json': 'A serialized python object, stored using the json module.',
 'yaml': 'A serialized python object, stored using the yaml module.',
 'cfg': 'A context free grammar.',
 'pcfg': 'A probabilistic CFG.',
 'fcfg': 'A feature CFG.',
 'fol': 'A list of first order logic expressions, parsed with nltk.sem.logic.Exp
...

A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk_data).

textwrap_indent = (source)

Undocumented

def _open(resource_url): (source)

Helper function that returns an open file object for a resource, given its resource URL. If the given resource URL uses the "nltk:" protocol, or uses no protocol, then use nltk.data.find to find its path, and open it with the given mode; if the resource URL uses the 'file' protocol, then open the file with the given mode; otherwise, delegate to urllib2.urlopen.

Parameters
resource_url:strA URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
_paths_from_env = (source)

Undocumented

_resource_cache: dict = (source)

A dictionary used to cache resources so that they won't need to be loaded more than once.