nltk.downloader

module documentation

(source)

The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.

Downloading Packages

If called with no arguments, download() will display an interactive interface which can be used to download and install new packages. If Tkinter is available, then a graphical interface will be shown, otherwise a simple text interface will be provided.

Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:

>>> download('treebank') # doctest: +SKIP
[nltk_data] Downloading package 'treebank'...
[nltk_data]   Unzipping corpora/treebank.zip.

NLTK also provides a number of "package collections", consisting of a group of related packages. To download all packages in a colleciton, simply call download() with the collection's identifier:

>>> download('all-corpora') # doctest: +SKIP
[nltk_data] Downloading package 'abc'...
[nltk_data]   Unzipping corpora/abc.zip.
[nltk_data] Downloading package 'alpino'...
[nltk_data]   Unzipping corpora/alpino.zip.
  ...
[nltk_data] Downloading package 'words'...
[nltk_data]   Unzipping corpora/words.zip.

Download Directory

By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user's home directory. However, the download_dir argument may be used to specify a different installation target, if desired.

See Downloader.default_download_dir() for more a detailed description of how the default download directory is chosen.

NLTK Download Server

Before downloading any packages, the corpus and module downloader contacts the NLTK download server, to retrieve an index file describing the available packages. By default, this index file is loaded from https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml. If necessary, it is possible to create a new Downloader object, specifying a different URL for the package index file.

Usage:

python nltk/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or:

python -m nltk.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

Class	`Collection`	A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by `Downloader`.
Class	`Downloader`	A class used to access the NLTK data server, which can be used to download corpora and other data packages.
Class	`DownloaderGUI`	Graphical interface for downloading packages from the NLTK data server.
Class	`DownloaderMessage`	A status message object, used by `incr_download` to communicate its progress.
Class	`DownloaderShell`	Undocumented
Class	`ErrorMessage`	Data server encountered an error
Class	`FinishCollectionMessage`	Data server has finished working on a collection of packages.
Class	`FinishDownloadMessage`	Data server has finished downloading a package.
Class	`FinishPackageMessage`	Data server has finished working on a package.
Class	`FinishUnzipMessage`	Data server has finished unzipping a package.
Class	`Package`	A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by `Downloader`. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.
Class	`ProgressMessage`	Indicates how much progress the data server has made
Class	`SelectDownloadDirMessage`	Indicates what download directory the data server is using
Class	`StaleMessage`	The package download file is out-of-date or corrupt
Class	`StartCollectionMessage`	Data server has started working on a collection of packages.
Class	`StartDownloadMessage`	Data server has started downloading a package.
Class	`StartPackageMessage`	Data server has started working on a package.
Class	`StartUnzipMessage`	Data server has started unzipping a package.
Class	`UpToDateMessage`	The package download file is already up-to-date
Function	`build_index`	Create a new data.xml index file, by combining the xml description files for various packages and collections. `root` should be the path to a directory containing the package xml and zip files; and the collection xml files...
Function	`download_gui`	Undocumented
Function	`download_shell`	Undocumented
Function	`md5_hexdigest`	Calculate and return the MD5 checksum for a given file. `file` may either be a filename or an open stream.
Function	`unzip`	Extract the contents of the zip file `filename` into the directory `root`.
Function	`update`	Undocumented
Variable	`TKINTER`	Undocumented
Function	`_check_package`	Helper for `build_index()`: Perform some checks to make sure that the given package is consistent.
Function	`_find_collections`	Helper for `build_index()`: Yield a list of ElementTree.Element objects, each holding the xml for a single package collection.
Function	`_find_packages`	Helper for `build_index()`: Yield a list of tuples `(pkg_xml, zf, subdir)`, where:
Function	`_indent_xml`	Helper for `build_index()`: Given an XML `ElementTree`, modify it (and its descendents) `text` and `tail` attributes to generate an indented tree, where each nested element is indented by 2 spaces with respect to its parent.
Function	`_md5_hexdigest`	Undocumented
Function	`_svn_revision`	Helper for `build_index()`: Calculate the subversion revision number for a given file (by using `subprocess` to run `svn`).
Function	`_unzip_iter`	Undocumented
Variable	`_downloader`	Undocumented

def build_index(root, base_url): (source) ¶

Create a new data.xml index file, by combining the xml description files for various packages and collections. root should be the path to a directory containing the package xml and zip files; and the collection xml files. The root directory is expected to have the following subdirectories:

root/
  packages/ .................. subdirectory for packages
    corpora/ ................. zip & xml files for corpora
    grammars/ ................ zip & xml files for grammars
    taggers/ ................. zip & xml files for taggers
    tokenizers/ .............. zip & xml files for tokenizers
    etc.
  collections/ ............... xml files for collections

For each package, there should be two files: package.zip (where package is the package name) which contains the package itself as a compressed zip file; and package.xml, which is an xml description of the package. The zipfile package.zip should expand to a single subdirectory named package/. The base filename package must match the identifier given in the package's xml file.

For each collection, there should be a single file collection.zip describing the collection, where collection is the name of the collection.

All identifiers (for both packages and collections) must be unique.

def download_gui(): (source) ¶

Undocumented

def download_shell(): (source) ¶

Undocumented

def md5_hexdigest(file): (source) ¶

Calculate and return the MD5 checksum for a given file. file may either be a filename or an open stream.

def unzip(filename, root, verbose=True): (source) ¶

Extract the contents of the zip file filename into the directory root.

def update(): (source) ¶

Undocumented

TKINTER: bool = (source) ¶

Undocumented

def _check_package(pkg_xml, zipfilename, zf): (source) ¶

Helper for build_index(): Perform some checks to make sure that the given package is consistent.

def _find_collections(root): (source) ¶

Helper for build_index(): Yield a list of ElementTree.Element objects, each holding the xml for a single package collection.

def _find_packages(root): (source) ¶

Helper for build_index(): Yield a list of tuples (pkg_xml, zf, subdir), where:

pkg_xml is an ElementTree.Element holding the xml for a package

zf is a zipfile.ZipFile for the package's contents.

subdir is the subdirectory (relative to root) where the package was found (e.g. 'corpora' or 'grammars').

def _indent_xml(xml, prefix=''): (source) ¶

Helper for build_index(): Given an XML ElementTree, modify it (and its descendents) text and tail attributes to generate an indented tree, where each nested element is indented by 2 spaces with respect to its parent.

def _md5_hexdigest(fp): (source) ¶

Undocumented

def _svn_revision(filename): (source) ¶

Helper for build_index(): Calculate the subversion revision number for a given file (by using subprocess to run svn).

def _unzip_iter(filename, root, verbose=True): (source) ¶

Undocumented

_downloader = (source) ¶

Undocumented