module documentation

Undocumented

Class Index Undocumented
Function acyclic_branches_depth_first Traverse the nodes of a tree in depth-first order, discarding eventual cycles within the same branch, but keep duplicate pathes in different branches. Add cut_mark (when defined) if cycles were truncated.
Function acyclic_breadth_first Traverse the nodes of a tree in breadth-first order, discarding eventual cycles.
Function acyclic_depth_first Traverse the nodes of a tree in depth-first order, discarding eventual cycles within any branch, adding cut_mark (when specified) if cycles were truncated.
Function acyclic_dic2tree Convert acyclic dictionary 'dic', where the keys are nodes, and the values are lists of children, to output tree suitable for pprint(), starting at root 'node', with subtrees as nested lists.
Function bigrams Return the bigrams generated from a sequence of items, as an iterator. For example:
Function binary_search_file Return the line from the file with first word key. Searches through a sorted file using the binary search algorithm.
Function breadth_first Traverse the nodes of a tree in breadth-first order. (No check for cycles.) The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.
Function choose This function is a fast way to calculate binomial coefficients, commonly known as nCk, i.e. the number of combinations of n things taken k at a time. (https://en.wikipedia.org/wiki/Binomial_coefficient...
Function clean_html Undocumented
Function clean_url Undocumented
Function elementtree_indent Recursive function to indent an ElementTree._ElementInterface used for pretty printing. Run indent on elem and then output in the normal way.
Function everygrams Returns all possible ngrams generated from a sequence of items, as an iterator.
Function filestring Undocumented
Function flatten Flatten a list.
Function guess_encoding Given a byte string, attempt to decode it. Tries the standard 'UTF8' and 'latin-1' encodings, Plus several gathered from locale information.
Function in_idle Return True if this function is run within idle. Tkinter programs that are run in idle should never call Tk.mainloop; so this function should be used to gate all calls to Tk.mainloop.
Function invert_dict Undocumented
Function invert_graph Inverts a directed graph.
Function ngrams Return the ngrams generated from a sequence of items, as an iterator. For example:
Function pad_sequence Returns a padded sequence of items before ngram extraction.
Function pairwise s -> (s0,s1), (s1,s2), (s2, s3), ...
Function parallelize_preprocess Undocumented
Function pr Pretty print a sequence of data items
Function print_string Pretty print a string, breaking lines on whitespace
Function py25 Undocumented
Function re_show Return a string with markers surrounding the matched substrings. Search str for substrings matching regexp and wrap the matches with braces. This is convenient for learning about regular expressions.
Function set_proxy Set the HTTP proxy for Python to download through.
Function skipgrams Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf...
Function tokenwrap Pretty print a list of text tokens, breaking lines on whitespace
Function transitive_closure Calculate the transitive closure of a directed graph, optionally the reflexive transitive closure.
Function trigrams Return the trigrams generated from a sequence of items, as an iterator. For example:
Function unique_list Undocumented
Function unweighted_minimum_spanning_tree Output a Minimum Spanning Tree (MST) of an unweighted graph, by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.
Function usage Undocumented
def acyclic_branches_depth_first(tree, children=iter, depth=-1, cut_mark=None, traversed=None): (source)

Traverse the nodes of a tree in depth-first order, discarding eventual cycles within the same branch, but keep duplicate pathes in different branches. Add cut_mark (when defined) if cycles were truncated.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

Catches only only cycles within the same branch, but keeping cycles from different branches:

>>> import nltk
>>> from nltk.util import acyclic_branches_depth_first as tree
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(tree(wn.synset('certified.a.01'), lambda s:s.also_sees(), cut_mark='...', depth=4))
[Synset('certified.a.01'),
 [Synset('authorized.a.01'),
  [Synset('lawful.a.01'),
   [Synset('legal.a.01'),
    "Cycle(Synset('lawful.a.01'),0,...)",
    [Synset('legitimate.a.01'), '...']],
   [Synset('straight.a.06'),
    [Synset('honest.a.01'), '...'],
    "Cycle(Synset('lawful.a.01'),0,...)"]],
  [Synset('legitimate.a.01'),
   "Cycle(Synset('authorized.a.01'),1,...)",
   [Synset('legal.a.01'),
    [Synset('lawful.a.01'), '...'],
    "Cycle(Synset('legitimate.a.01'),0,...)"],
   [Synset('valid.a.01'),
    "Cycle(Synset('legitimate.a.01'),0,...)",
    [Synset('reasonable.a.01'), '...']]],
  [Synset('official.a.01'), "Cycle(Synset('authorized.a.01'),1,...)"]],
 [Synset('documented.a.01')]]
def acyclic_breadth_first(tree, children=iter, maxdepth=-1): (source)

Traverse the nodes of a tree in breadth-first order, discarding eventual cycles.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

def acyclic_depth_first(tree, children=iter, depth=-1, cut_mark=None, traversed=None): (source)

Traverse the nodes of a tree in depth-first order, discarding eventual cycles within any branch, adding cut_mark (when specified) if cycles were truncated.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

Catches all cycles:

>>> import nltk
>>> from nltk.util import acyclic_depth_first as acyclic_tree
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(acyclic_tree(wn.synset('dog.n.01'), lambda s:s.hypernyms(),cut_mark='...'))
[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'), "Cycle(Synset('animal.n.01'),-3,...)"]]
def acyclic_dic2tree(node, dic): (source)

Convert acyclic dictionary 'dic', where the keys are nodes, and the values are lists of children, to output tree suitable for pprint(), starting at root 'node', with subtrees as nested lists.

def bigrams(sequence, **kwargs): (source)

Return the bigrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import bigrams
>>> list(bigrams([1,2,3,4,5]))
[(1, 2), (2, 3), (3, 4), (4, 5)]

Use bigrams for a list version of this function.

Parameters
sequence:sequence or iterthe source data to be converted into bigrams
**kwargsUndocumented
Returns
iter(tuple)Undocumented
def binary_search_file(file, key, cache={}, cacheDepth=-1): (source)

Return the line from the file with first word key. Searches through a sorted file using the binary search algorithm.

Parameters
file:filethe file to be searched through.
key:strthe identifier we are searching for.
cacheUndocumented
cacheDepthUndocumented
def breadth_first(tree, children=iter, maxdepth=-1): (source)

Traverse the nodes of a tree in breadth-first order. (No check for cycles.) The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

def choose(n, k): (source)

This function is a fast way to calculate binomial coefficients, commonly known as nCk, i.e. the number of combinations of n things taken k at a time. (https://en.wikipedia.org/wiki/Binomial_coefficient).

This is the scipy.special.comb() with long integer computation but this approximation is faster, see https://github.com/nltk/nltk/issues/1181

>>> choose(4, 2)
6
>>> choose(6, 2)
15
Parameters
n:intThe number of things.
kUndocumented
r:intThe number of times a thing is taken.
def clean_html(html): (source)

Undocumented

def clean_url(url): (source)

Undocumented

def elementtree_indent(elem, level=0): (source)

Recursive function to indent an ElementTree._ElementInterface used for pretty printing. Run indent on elem and then output in the normal way.

Parameters
elem:ElementTree._ElementInterfaceelement to be indented. will be modified.
level:nonnegative integerlevel of indentation for this element
Returns
ElementTree._ElementInterfaceContents of elem indented to reflect its structure
def everygrams(sequence, min_len=1, max_len=-1, pad_left=False, pad_right=False, **kwargs): (source)

Returns all possible ngrams generated from a sequence of items, as an iterator.

>>> sent = 'a b c'.split()
New version outputs for everygrams.
>>> list(everygrams(sent))
[('a',), ('a', 'b'), ('a', 'b', 'c'), ('b',), ('b', 'c'), ('c',)]
Old version outputs for everygrams.
>>> sorted(everygrams(sent), key=len)
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]
>>> list(everygrams(sent, max_len=2))
[('a',), ('a', 'b'), ('b',), ('b', 'c'), ('c',)]
Parameters
sequence:sequence or iterthe source data to be converted into ngrams. If max_len is not provided, this sequence will be loaded into memory
min_len:intminimum length of the ngrams, aka. n-gram order/degree of ngram
max_len:intmaximum length of the ngrams (set to length of sequence by default)
pad_left:boolwhether the ngrams should be left-padded
pad_right:boolwhether the ngrams should be right-padded
**kwargsUndocumented
Returns
iter(tuple)Undocumented
def filestring(f): (source)

Undocumented

def flatten(*args): (source)

Flatten a list.

>>> from nltk.util import flatten
>>> flatten(1, 2, ['b', 'a' , ['c', 'd']], 3)
[1, 2, 'b', 'a', 'c', 'd', 3]
Parameters
*argsitems and lists to be combined into a single list
Returns
listUndocumented
def guess_encoding(data): (source)

Given a byte string, attempt to decode it. Tries the standard 'UTF8' and 'latin-1' encodings, Plus several gathered from locale information.

The calling program must first call:

locale.setlocale(locale.LC_ALL, '')

If successful it returns (decoded_unicode, successful_encoding). If unsuccessful it raises a UnicodeError.

def in_idle(): (source)

Return True if this function is run within idle. Tkinter programs that are run in idle should never call Tk.mainloop; so this function should be used to gate all calls to Tk.mainloop.

Returns
boolUndocumented
Unknown Field: warning
This function works by checking sys.stdin. If the user has modified sys.stdin, then it may return incorrect results.
def invert_dict(d): (source)

Undocumented

def invert_graph(graph): (source)

Inverts a directed graph.

Parameters
graph:dict(set)the graph, represented as a dictionary of sets
Returns
dict(set)the inverted graph
def ngrams(sequence, n, **kwargs): (source)

Return the ngrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Wrap with list for a list version of this function. Set pad_left or pad_right to true in order to get additional ngrams:

>>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
Parameters
sequence:sequence or iterthe source data to be converted into ngrams
n:intthe degree of the ngrams
pad_left:boolwhether the ngrams should be left-padded
pad_right:boolwhether the ngrams should be right-padded
left_pad_symbol:anythe symbol to use for left padding (default is None)
right_pad_symbol:anythe symbol to use for right padding (default is None)
**kwargsUndocumented
Returns
sequence or iterUndocumented
def pad_sequence(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None): (source)

Returns a padded sequence of items before ngram extraction.

>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
['<s>', 1, 2, 3, 4, 5, '</s>']
>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
['<s>', 1, 2, 3, 4, 5]
>>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[1, 2, 3, 4, 5, '</s>']
Parameters
sequence:sequence or iterthe source data to be padded
n:intthe degree of the ngrams
pad_left:boolwhether the ngrams should be left-padded
pad_right:boolwhether the ngrams should be right-padded
left_pad_symbol:anythe symbol to use for left padding (default is None)
right_pad_symbol:anythe symbol to use for right padding (default is None)
Returns
sequence or iterUndocumented
def pairwise(iterable): (source)

s -> (s0,s1), (s1,s2), (s2, s3), ...

def parallelize_preprocess(func, iterator, processes, progress_bar=False): (source)

Undocumented

def pr(data, start=0, end=None): (source)

Pretty print a sequence of data items

Parameters
data:sequence or iterthe data stream to print
start:intthe start position
end:intthe end position
def print_string(s, width=70): (source)

Pretty print a string, breaking lines on whitespace

Parameters
s:strthe string to print, consisting of words and spaces
width:intthe display width
def py25(): (source)

Undocumented

def re_show(regexp, string, left='{', right='}'): (source)

Return a string with markers surrounding the matched substrings. Search str for substrings matching regexp and wrap the matches with braces. This is convenient for learning about regular expressions.

Parameters
regexp:strThe regular expression.
string:strThe string being matched.
left:strThe left delimiter (printed before the matched substring)
right:strThe right delimiter (printed after the matched substring)
Returns
strUndocumented
def set_proxy(proxy, user=None, password=''): (source)

Set the HTTP proxy for Python to download through.

If proxy is None then tries to set proxy from environment or system settings.

Parameters
proxyThe HTTP proxy server to use. For example: 'http://proxy.example.com:3128/'
userThe username to authenticate with. Use None to disable authentication.
passwordThe password to authenticate with.
def skipgrams(sequence, n, k, **kwargs): (source)

Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
Parameters
sequence:sequence or iterthe source data to be converted into trigrams
n:intthe degree of the ngrams
k:intthe skip distance
**kwargsUndocumented
Returns
iter(tuple)Undocumented
def tokenwrap(tokens, separator=' ', width=70): (source)

Pretty print a list of text tokens, breaking lines on whitespace

Parameters
tokens:listthe tokens to print
separator:strthe string to use to separate tokens
width:intthe display width (default=70)
def transitive_closure(graph, reflexive=False): (source)

Calculate the transitive closure of a directed graph, optionally the reflexive transitive closure.

The algorithm is a slight modification of the "Marking Algorithm" of Ioannidis & Ramakrishnan (1998) "Efficient Transitive Closure Algorithms".

Parameters
graph:dict(set)the initial graph, represented as a dictionary of sets
reflexive:boolif set, also make the closure reflexive
Returns
dict(set)Undocumented
def trigrams(sequence, **kwargs): (source)

Return the trigrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import trigrams
>>> list(trigrams([1,2,3,4,5]))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use trigrams for a list version of this function.

Parameters
sequence:sequence or iterthe source data to be converted into trigrams
**kwargsUndocumented
Returns
iter(tuple)Undocumented
def unique_list(xs): (source)

Undocumented

def unweighted_minimum_spanning_tree(tree, children=iter): (source)

Output a Minimum Spanning Tree (MST) of an unweighted graph, by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

>>> import nltk
>>> from nltk.util import unweighted_minimum_spanning_tree as mst
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(mst(wn.synset('bound.a.01'), lambda s:s.also_sees()))
[Synset('bound.a.01'),
 [Synset('unfree.a.02'),
  [Synset('confined.a.02')],
  [Synset('dependent.a.01')],
  [Synset('restricted.a.01'), [Synset('classified.a.02')]]]]
def usage(obj, selfname='self'): (source)

Undocumented