nltk.data

module documentation

(source)

Functions to find and load NLTK resource files, such as corpora, grammars, and saved processing objects. Resource files are identified using URLs, such as nltk:corpora/abc/rural.txt or http://nltk.org/sample/toy.cfg. The following URL protocols are supported:

file:path: Specifies the file whose path is path. Both relative and absolute paths may be used.

http://host/path: Specifies the file stored on the web server host at path path.

nltk:path: Specifies the file stored in the NLTK data package at path. NLTK will search for these files in the directories specified by nltk.data.path.

If no protocol is specified, then the default protocol nltk: will be used.

This module provides to functions that can be used to access a resource file, given its URL: load() loads a given resource, and adds it to a resource cache; and retrieve() copies a given resource to a local file.

Class	`BufferedGzipFile`	A `GzipFile` subclass for compatibility with older nltk releases.
Class	`FileSystemPathPointer`	A path pointer that identifies a file which can be accessed directly via a given absolute path.
Class	`GzipFileSystemPathPointer`	A subclass of `FileSystemPathPointer` that identifies a gzip-compressed file located at a given absolute path. `GzipFileSystemPathPointer` is appropriate for loading large gzip-compressed pickle objects efficiently.
Class	`LazyLoader`	Undocumented
Class	`OpenOnDemandZipFile`	A subclass of `zipfile.ZipFile` that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile. This is useful for reducing the number of open file handles when many zip files are being accessed at once...
Class	`PathPointer`	An abstract base class for 'path pointers,' used by NLTK's data package to identify specific paths. Two subclasses exist: `FileSystemPathPointer` identifies a file that can be accessed directly via a given absolute path...
Class	`SeekableUnicodeStreamReader`	A stream reader that automatically encodes the source byte stream into unicode (like `codecs.StreamReader`); but still supports the `seek()` and `tell()` operations correctly. This is in contrast to `codecs.StreamReader`...
Class	`ZipFilePathPointer`	A path pointer that identifies a file contained within a zipfile, which can be accessed by reading that zipfile.
Function	`clear_cache`	Remove all objects from the resource cache. :see: load()
Function	`find`	Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a ...
Function	`gzip_open_unicode`	Undocumented
Function	`load`	Load a given resource from the NLTK data package. The following resource formats are currently supported:
Function	`normalize_resource_name`	No summary
Function	`normalize_resource_url`	Normalizes a resource url
Function	`retrieve`	Copy the given resource to a local file. If no filename is specified, then use the URL's filename. If there is already a file named `filename`, then raise a `ValueError`.
Function	`show_cfg`	Write out a grammar file, ignoring escaped and empty lines.
Function	`split_resource_url`	Splits a resource url into "<protocol>:<path>".
Constant	`AUTO_FORMATS`	Undocumented
Constant	`FORMATS`	Undocumented
Variable	`path`	A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e...
Variable	`textwrap_indent`	Undocumented
Function	`_open`	Helper function that returns an open file object for a resource, given its resource URL. If the given resource URL uses the "nltk:" protocol, or uses no protocol, then use `nltk.data.find` to find its path, and open it with the given mode; if the resource URL uses the 'file' protocol, then open the file with the given mode; otherwise, delegate to ...
Variable	`_paths_from_env`	Undocumented
Variable	`_resource_cache`	A dictionary used to cache resources so that they won't need to be loaded more than once.

def clear_cache(): (source) ¶

Remove all objects from the resource cache. :see: load()

def find(resource_name, paths=None): (source) ¶

Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a LookupError, whose message gives a pointer to the installation instructions for the NLTK downloader.

Zip File Handling:

If resource_name contains a component with a .zip extension, then it is assumed to be a zipfile; and the remaining path components are used to look inside the zipfile.

If any element of nltk.data.path has a .zip extension, then it is assumed to be a zipfile.

If a given resource name that does not contain any zipfile component is not found initially, then find() will make a second attempt to find that resource, by replacing each component p in the path with p.zip/p. For example, this allows find() to map the resource name corpora/chat80/cities.pl to a zip file path pointer to corpora/chat80.zip/chat80/cities.pl.

When using find() to locate a directory contained in a zipfile, the resource name must end with the forward slash character. Otherwise, find() will not locate the directory.

Parameters
resource_name:str or unicode	The name of the resource to search for. Resource names are posix-style relative path names, such as `corpora/brown`. Directory names will be automatically converted to a platform-appropriate path separator.
paths	Undocumented
Returns
str	Undocumented

def gzip_open_unicode(filename, mode='rb', compresslevel=9, encoding='utf-8', fileobj=None, errors=None, newline=None): (source) ¶

Undocumented

def load(resource_url, format='auto', cache=True, verbose=False, logic_parser=None, fstruct_reader=None, encoding=None): (source) ¶

Load a given resource from the NLTK data package. The following resource formats are currently supported:

pickle

json

yaml

cfg (context free grammars)

pcfg (probabilistic CFGs)

fcfg (feature-based CFGs)

fol (formulas of First Order Logic)

logic (Logical formulas to be parsed by the given logic_parser)

val (valuation of First Order Logic model)

text (the file contents as a unicode string)

raw (the raw file contents as a byte string)

If no format is specified, load() will attempt to determine a format based on the resource name's file extension. If that fails, load() will raise a ValueError exception.

For all text formats (everything except pickle, json, yaml and raw), it tries to decode the raw contents using UTF-8, and if that doesn't work, it tries with ISO-8859-1 (Latin-1), unless the encoding is specified.

Parameters
resource_url:str	A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
format	Undocumented
cache:bool	If true, add this resource to a cache. If load() finds a resource in its cache, then it will return it from the cache rather than loading it.
verbose:bool	If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache.
logic_parser:LogicParser	The parser that will be used to parse logical expressions.
fstruct_reader:FeatStructReader	The parser that will be used to parse the feature structure of an fcfg.
encoding:str	the encoding of the input; only used for text formats.

def normalize_resource_name(resource_name, allow_relative=True, relative_path=None): (source) ¶

>>> windows = sys.platform.startswith('win')
>>> normalize_resource_name('.', True)
'./'
>>> normalize_resource_name('./', True)
'./'
>>> windows or normalize_resource_name('dir/file', False, '/') == '/dir/file'
True
>>> not windows or normalize_resource_name('C:/file', False, '/') == '/C:/file'
True
>>> windows or normalize_resource_name('/dir/file', False, '/') == '/dir/file'
True
>>> windows or normalize_resource_name('../dir/file', False, '/') == '/dir/file'
True
>>> not windows or normalize_resource_name('/dir/file', True, '/') == 'dir/file'
True
>>> windows or normalize_resource_name('/dir/file', True, '/') == '/dir/file'
True

Parameters
resource_name:str or unicode	The name of the resource to search for. Resource names are posix-style relative path names, such as `corpora/brown`. Directory names will automatically be converted to a platform-appropriate path separator. Directory trailing slashes are preserved
allow_relative	Undocumented
relative_path	Undocumented

def normalize_resource_url(resource_url): (source) ¶

Normalizes a resource url

>>> windows = sys.platform.startswith('win')
>>> os.path.normpath(split_resource_url(normalize_resource_url('file:grammar.fcfg'))[1]) == \
... ('\\' if windows else '') + os.path.abspath(os.path.join(os.curdir, 'grammar.fcfg'))
True
>>> not windows or normalize_resource_url('file:C:/dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('file:C:\\dir\\file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('file:C:\\dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('file://C:/dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('file:////C:/dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('nltk:C:/dir/file') == 'file:///C:/dir/file'
True
>>> not windows or normalize_resource_url('nltk:C:\\dir\\file') == 'file:///C:/dir/file'
True
>>> windows or normalize_resource_url('file:/dir/file/toy.cfg') == 'file:///dir/file/toy.cfg'
True
>>> normalize_resource_url('nltk:home/nltk')
'nltk:home/nltk'
>>> windows or normalize_resource_url('nltk:/home/nltk') == 'file:///home/nltk'
True
>>> normalize_resource_url('http://example.com/dir/file')
'http://example.com/dir/file'
>>> normalize_resource_url('dir/file')
'nltk:dir/file'

def retrieve(resource_url, filename=None, verbose=True): (source) ¶

Copy the given resource to a local file. If no filename is specified, then use the URL's filename. If there is already a file named filename, then raise a ValueError.

Parameters
resource_url:str	A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
filename	Undocumented
verbose	Undocumented

def show_cfg(resource_url, escape='##'): (source) ¶

Write out a grammar file, ignoring escaped and empty lines.

Parameters
resource_url:str	A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
escape:str	Prepended string that signals lines to be ignored

def split_resource_url(resource_url): (source) ¶

Splits a resource url into "<protocol>:<path>".

>>> windows = sys.platform.startswith('win')
>>> split_resource_url('nltk:home/nltk')
('nltk', 'home/nltk')
>>> split_resource_url('nltk:/home/nltk')
('nltk', '/home/nltk')
>>> split_resource_url('file:/home/nltk')
('file', '/home/nltk')
>>> split_resource_url('file:///home/nltk')
('file', '/home/nltk')
>>> split_resource_url('file:///C:/home/nltk')
('file', '/C:/home/nltk')

AUTO_FORMATS: dict[str, str] = (source) ¶

Undocumented

Value

{'pickle': 'pickle',
 'json': 'json',
 'yaml': 'yaml',
 'cfg': 'cfg',
 'pcfg': 'pcfg',
 'fcfg': 'fcfg',
 'fol': 'fol',
...

FORMATS: dict[str, str] = (source) ¶

Undocumented

Value

{'pickle': 'A serialized python object, stored using the pickle module.',
 'json': 'A serialized python object, stored using the json module.',
 'yaml': 'A serialized python object, stored using the yaml module.',
 'cfg': 'A context free grammar.',
 'pcfg': 'A probabilistic CFG.',
 'fcfg': 'A feature CFG.',
 'fol': 'A list of first order logic expressions, parsed with nltk.sem.logic.Exp↵
...

path = (source) ¶

A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk_data).

textwrap_indent = (source) ¶

Undocumented

def _open(resource_url): (source) ¶

Helper function that returns an open file object for a resource, given its resource URL. If the given resource URL uses the "nltk:" protocol, or uses no protocol, then use nltk.data.find to find its path, and open it with the given mode; if the resource URL uses the 'file' protocol, then open the file with the given mode; otherwise, delegate to urllib2.urlopen.

Parameters
resource_url:str	A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package.

_paths_from_env = (source) ¶

Undocumented

_resource_cache: dict = (source) ¶

A dictionary used to cache resources so that they won't need to be loaded more than once.