Functions to find and load NLTK resource files, such as corpora, grammars, and saved processing objects. Resource files are identified using URLs, such as nltk:corpora/abc/rural.txt or http://nltk.org/sample/toy.cfg. The following URL protocols are supported:
- file:path: Specifies the file whose path is path. Both relative and absolute paths may be used.
- http://host/path: Specifies the file stored on the web server host at path path.
- nltk:path: Specifies the file stored in the NLTK data package at path. NLTK will search for these files in the directories specified by nltk.data.path.
If no protocol is specified, then the default protocol nltk: will be used.
This module provides to functions that can be used to access a resource file, given its URL: load() loads a given resource, and adds it to a resource cache; and retrieve() copies a given resource to a local file.
Class |
|
A GzipFile subclass for compatibility with older nltk releases. |
Class |
|
A path pointer that identifies a file which can be accessed directly via a given absolute path. |
Class |
|
A subclass of FileSystemPathPointer that identifies a gzip-compressed file located at a given absolute path. GzipFileSystemPathPointer is appropriate for loading large gzip-compressed pickle objects efficiently. |
Class |
|
Undocumented |
Class |
|
A subclass of zipfile.ZipFile that closes its file pointer whenever it is not using it; and re-opens it when it needs to read data from the zipfile. This is useful for reducing the number of open file handles when many zip files are being accessed at once... |
Class |
|
An abstract base class for 'path pointers,' used by NLTK's data package to identify specific paths. Two subclasses exist: FileSystemPathPointer identifies a file that can be accessed directly via a given absolute path... |
Class |
|
A stream reader that automatically encodes the source byte stream into unicode (like codecs.StreamReader); but still supports the seek() and tell() operations correctly. This is in contrast to codecs.StreamReader... |
Class |
|
A path pointer that identifies a file contained within a zipfile, which can be accessed by reading that zipfile. |
Function | clear |
Remove all objects from the resource cache. :see: load() |
Function | find |
Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a ... |
Function | gzip |
Undocumented |
Function | load |
Load a given resource from the NLTK data package. The following resource formats are currently supported: |
Function | normalize |
No summary |
Function | normalize |
Normalizes a resource url |
Function | retrieve |
Copy the given resource to a local file. If no filename is specified, then use the URL's filename. If there is already a file named filename, then raise a ValueError. |
Function | show |
Write out a grammar file, ignoring escaped and empty lines. |
Function | split |
Splits a resource url into "<protocol>:<path>". |
Constant | AUTO |
Undocumented |
Constant | FORMATS |
Undocumented |
Variable | path |
A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e... |
Variable | textwrap |
Undocumented |
Function | _open |
Helper function that returns an open file object for a resource, given its resource URL. If the given resource URL uses the "nltk:" protocol, or uses no protocol, then use nltk.data.find to find its path, and open it with the given mode; if the resource URL uses the 'file' protocol, then open the file with the given mode; otherwise, delegate to ... |
Variable | _paths |
Undocumented |
Variable | _resource |
A dictionary used to cache resources so that they won't need to be loaded more than once. |
Find the given resource by searching through the directories and zip files in paths, where a None or empty string specifies an absolute path. Returns a corresponding path name. If the given resource is not found, raise a LookupError, whose message gives a pointer to the installation instructions for the NLTK downloader.
Zip File Handling:
- If resource_name contains a component with a .zip extension, then it is assumed to be a zipfile; and the remaining path components are used to look inside the zipfile.
- If any element of nltk.data.path has a .zip extension, then it is assumed to be a zipfile.
- If a given resource name that does not contain any zipfile component is not found initially, then find() will make a second attempt to find that resource, by replacing each component p in the path with p.zip/p. For example, this allows find() to map the resource name corpora/chat80/cities.pl to a zip file path pointer to corpora/chat80.zip/chat80/cities.pl.
- When using find() to locate a directory contained in a zipfile, the resource name must end with the forward slash character. Otherwise, find() will not locate the directory.
Parameters | |
resource | The name of the resource to search for. Resource names are posix-style relative path names, such as corpora/brown. Directory names will be automatically converted to a platform-appropriate path separator. |
paths | Undocumented |
Returns | |
str | Undocumented |
Undocumented
Load a given resource from the NLTK data package. The following resource formats are currently supported:
- pickle
- json
- yaml
- cfg (context free grammars)
- pcfg (probabilistic CFGs)
- fcfg (feature-based CFGs)
- fol (formulas of First Order Logic)
- logic (Logical formulas to be parsed by the given logic_parser)
- val (valuation of First Order Logic model)
- text (the file contents as a unicode string)
- raw (the raw file contents as a byte string)
If no format is specified, load() will attempt to determine a format based on the resource name's file extension. If that fails, load() will raise a ValueError exception.
For all text formats (everything except pickle, json, yaml and raw), it tries to decode the raw contents using UTF-8, and if that doesn't work, it tries with ISO-8859-1 (Latin-1), unless the encoding is specified.
Parameters | |
resource | A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package. |
format | Undocumented |
cache:bool | If true, add this resource to a cache. If load() finds a resource in its cache, then it will return it from the cache rather than loading it. |
verbose:bool | If true, print a message when loading a resource. Messages are not displayed when a resource is retrieved from the cache. |
logic | The parser that will be used to parse logical expressions. |
fstruct | The parser that will be used to parse the feature structure of an fcfg. |
encoding:str | the encoding of the input; only used for text formats. |
>>> windows = sys.platform.startswith('win') >>> normalize_resource_name('.', True) './' >>> normalize_resource_name('./', True) './' >>> windows or normalize_resource_name('dir/file', False, '/') == '/dir/file' True >>> not windows or normalize_resource_name('C:/file', False, '/') == '/C:/file' True >>> windows or normalize_resource_name('/dir/file', False, '/') == '/dir/file' True >>> windows or normalize_resource_name('../dir/file', False, '/') == '/dir/file' True >>> not windows or normalize_resource_name('/dir/file', True, '/') == 'dir/file' True >>> windows or normalize_resource_name('/dir/file', True, '/') == '/dir/file' True
Parameters | |
resource | The name of the resource to search for. Resource names are posix-style relative path names, such as corpora/brown. Directory names will automatically be converted to a platform-appropriate path separator. Directory trailing slashes are preserved |
allow | Undocumented |
relative | Undocumented |
Normalizes a resource url
>>> windows = sys.platform.startswith('win') >>> os.path.normpath(split_resource_url(normalize_resource_url('file:grammar.fcfg'))[1]) == \ ... ('\\' if windows else '') + os.path.abspath(os.path.join(os.curdir, 'grammar.fcfg')) True >>> not windows or normalize_resource_url('file:C:/dir/file') == 'file:///C:/dir/file' True >>> not windows or normalize_resource_url('file:C:\\dir\\file') == 'file:///C:/dir/file' True >>> not windows or normalize_resource_url('file:C:\\dir/file') == 'file:///C:/dir/file' True >>> not windows or normalize_resource_url('file://C:/dir/file') == 'file:///C:/dir/file' True >>> not windows or normalize_resource_url('file:////C:/dir/file') == 'file:///C:/dir/file' True >>> not windows or normalize_resource_url('nltk:C:/dir/file') == 'file:///C:/dir/file' True >>> not windows or normalize_resource_url('nltk:C:\\dir\\file') == 'file:///C:/dir/file' True >>> windows or normalize_resource_url('file:/dir/file/toy.cfg') == 'file:///dir/file/toy.cfg' True >>> normalize_resource_url('nltk:home/nltk') 'nltk:home/nltk' >>> windows or normalize_resource_url('nltk:/home/nltk') == 'file:///home/nltk' True >>> normalize_resource_url('http://example.com/dir/file') 'http://example.com/dir/file' >>> normalize_resource_url('dir/file') 'nltk:dir/file'
Copy the given resource to a local file. If no filename is specified, then use the URL's filename. If there is already a file named filename, then raise a ValueError.
Parameters | |
resource | A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package. |
filename | Undocumented |
verbose | Undocumented |
Write out a grammar file, ignoring escaped and empty lines.
Parameters | |
resource | A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package. |
escape:str | Prepended string that signals lines to be ignored |
Splits a resource url into "<protocol>:<path>".
>>> windows = sys.platform.startswith('win') >>> split_resource_url('nltk:home/nltk') ('nltk', 'home/nltk') >>> split_resource_url('nltk:/home/nltk') ('nltk', '/home/nltk') >>> split_resource_url('file:/home/nltk') ('file', '/home/nltk') >>> split_resource_url('file:///home/nltk') ('file', '/home/nltk') >>> split_resource_url('file:///C:/home/nltk') ('file', '/C:/home/nltk')
Undocumented
Value |
|
Undocumented
Value |
|
A list of directories where the NLTK data package might reside. These directories will be checked in order when looking for a resource in the data package. Note that this allows users to substitute in their own versions of resources, if they have them (e.g., in their home directory under ~/nltk_data).
Helper function that returns an open file object for a resource, given its resource URL. If the given resource URL uses the "nltk:" protocol, or uses no protocol, then use nltk.data.find to find its path, and open it with the given mode; if the resource URL uses the 'file' protocol, then open the file with the given mode; otherwise, delegate to urllib2.urlopen.
Parameters | |
resource | A URL specifying where the resource should be loaded from. The default protocol is "nltk:", which searches for the file in the the NLTK data package. |