nltk.util

module documentation

(source)

Undocumented

Class	`Index`	Undocumented
Function	`acyclic_branches_depth_first`	Traverse the nodes of a tree in depth-first order, discarding eventual cycles within the same branch, but keep duplicate pathes in different branches. Add cut_mark (when defined) if cycles were truncated.
Function	`acyclic_breadth_first`	Traverse the nodes of a tree in breadth-first order, discarding eventual cycles.
Function	`acyclic_depth_first`	Traverse the nodes of a tree in depth-first order, discarding eventual cycles within any branch, adding cut_mark (when specified) if cycles were truncated.
Function	`acyclic_dic2tree`	Convert acyclic dictionary 'dic', where the keys are nodes, and the values are lists of children, to output tree suitable for pprint(), starting at root 'node', with subtrees as nested lists.
Function	`bigrams`	Return the bigrams generated from a sequence of items, as an iterator. For example:
Function	`binary_search_file`	Return the line from the file with first word key. Searches through a sorted file using the binary search algorithm.
Function	`breadth_first`	Traverse the nodes of a tree in breadth-first order. (No check for cycles.) The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.
Function	`choose`	This function is a fast way to calculate binomial coefficients, commonly known as nCk, i.e. the number of combinations of n things taken k at a time. (https://en.wikipedia.org/wiki/Binomial_coefficient...
Function	`clean_html`	Undocumented
Function	`clean_url`	Undocumented
Function	`elementtree_indent`	Recursive function to indent an ElementTree._ElementInterface used for pretty printing. Run indent on elem and then output in the normal way.
Function	`everygrams`	Returns all possible ngrams generated from a sequence of items, as an iterator.
Function	`filestring`	Undocumented
Function	`flatten`	Flatten a list.
Function	`guess_encoding`	Given a byte string, attempt to decode it. Tries the standard 'UTF8' and 'latin-1' encodings, Plus several gathered from locale information.
Function	`in_idle`	Return True if this function is run within idle. Tkinter programs that are run in idle should never call `Tk.mainloop`; so this function should be used to gate all calls to `Tk.mainloop`.
Function	`invert_dict`	Undocumented
Function	`invert_graph`	Inverts a directed graph.
Function	`ngrams`	Return the ngrams generated from a sequence of items, as an iterator. For example:
Function	`pad_sequence`	Returns a padded sequence of items before ngram extraction.
Function	`pairwise`	s -> (s0,s1), (s1,s2), (s2, s3), ...
Function	`parallelize_preprocess`	Undocumented
Function	`pr`	Pretty print a sequence of data items
Function	`print_string`	Pretty print a string, breaking lines on whitespace
Function	`py25`	Undocumented
Function	`re_show`	Return a string with markers surrounding the matched substrings. Search str for substrings matching `regexp` and wrap the matches with braces. This is convenient for learning about regular expressions.
Function	`set_proxy`	Set the HTTP proxy for Python to download through.
Function	`skipgrams`	Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf...
Function	`tokenwrap`	Pretty print a list of text tokens, breaking lines on whitespace
Function	`transitive_closure`	Calculate the transitive closure of a directed graph, optionally the reflexive transitive closure.
Function	`trigrams`	Return the trigrams generated from a sequence of items, as an iterator. For example:
Function	`unique_list`	Undocumented
Function	`unweighted_minimum_spanning_tree`	Output a Minimum Spanning Tree (MST) of an unweighted graph, by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.
Function	`usage`	Undocumented

def acyclic_branches_depth_first(tree, children=iter, depth=-1, cut_mark=None, traversed=None): (source) ¶

Traverse the nodes of a tree in depth-first order, discarding eventual cycles within the same branch, but keep duplicate pathes in different branches. Add cut_mark (when defined) if cycles were truncated.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

Catches only only cycles within the same branch, but keeping cycles from different branches:

>>> import nltk
>>> from nltk.util import acyclic_branches_depth_first as tree
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(tree(wn.synset('certified.a.01'), lambda s:s.also_sees(), cut_mark='...', depth=4))
[Synset('certified.a.01'),
 [Synset('authorized.a.01'),
  [Synset('lawful.a.01'),
   [Synset('legal.a.01'),
    "Cycle(Synset('lawful.a.01'),0,...)",
    [Synset('legitimate.a.01'), '...']],
   [Synset('straight.a.06'),
    [Synset('honest.a.01'), '...'],
    "Cycle(Synset('lawful.a.01'),0,...)"]],
  [Synset('legitimate.a.01'),
   "Cycle(Synset('authorized.a.01'),1,...)",
   [Synset('legal.a.01'),
    [Synset('lawful.a.01'), '...'],
    "Cycle(Synset('legitimate.a.01'),0,...)"],
   [Synset('valid.a.01'),
    "Cycle(Synset('legitimate.a.01'),0,...)",
    [Synset('reasonable.a.01'), '...']]],
  [Synset('official.a.01'), "Cycle(Synset('authorized.a.01'),1,...)"]],
 [Synset('documented.a.01')]]

def acyclic_breadth_first(tree, children=iter, maxdepth=-1): (source) ¶

Traverse the nodes of a tree in breadth-first order, discarding eventual cycles.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

def acyclic_depth_first(tree, children=iter, depth=-1, cut_mark=None, traversed=None): (source) ¶

Traverse the nodes of a tree in depth-first order, discarding eventual cycles within any branch, adding cut_mark (when specified) if cycles were truncated.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

Catches all cycles:

>>> import nltk
>>> from nltk.util import acyclic_depth_first as acyclic_tree
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(acyclic_tree(wn.synset('dog.n.01'), lambda s:s.hypernyms(),cut_mark='...'))
[Synset('dog.n.01'),
 [Synset('canine.n.02'),
  [Synset('carnivore.n.01'),
   [Synset('placental.n.01'),
    [Synset('mammal.n.01'),
     [Synset('vertebrate.n.01'),
      [Synset('chordate.n.01'),
       [Synset('animal.n.01'),
        [Synset('organism.n.01'),
         [Synset('living_thing.n.01'),
          [Synset('whole.n.02'),
           [Synset('object.n.01'),
            [Synset('physical_entity.n.01'),
             [Synset('entity.n.01')]]]]]]]]]]]]],
 [Synset('domestic_animal.n.01'), "Cycle(Synset('animal.n.01'),-3,...)"]]

def acyclic_dic2tree(node, dic): (source) ¶

Convert acyclic dictionary 'dic', where the keys are nodes, and the values are lists of children, to output tree suitable for pprint(), starting at root 'node', with subtrees as nested lists.

def bigrams(sequence, **kwargs): (source) ¶

Return the bigrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import bigrams
>>> list(bigrams([1,2,3,4,5]))
[(1, 2), (2, 3), (3, 4), (4, 5)]

Use bigrams for a list version of this function.

Parameters
sequence:sequence or iter	the source data to be converted into bigrams
**kwargs	Undocumented
Returns
iter(tuple)	Undocumented

def binary_search_file(file, key, cache={}, cacheDepth=-1): (source) ¶

Return the line from the file with first word key. Searches through a sorted file using the binary search algorithm.

Parameters
file:file	the file to be searched through.
key:str	the identifier we are searching for.
cache	Undocumented
cacheDepth	Undocumented

def breadth_first(tree, children=iter, maxdepth=-1): (source) ¶

Traverse the nodes of a tree in breadth-first order. (No check for cycles.) The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

def choose(n, k): (source) ¶

This function is a fast way to calculate binomial coefficients, commonly known as nCk, i.e. the number of combinations of n things taken k at a time. (https://en.wikipedia.org/wiki/Binomial_coefficient).

This is the scipy.special.comb() with long integer computation but this approximation is faster, see https://github.com/nltk/nltk/issues/1181

>>> choose(4, 2)
6
>>> choose(6, 2)
15

Parameters
n:int	The number of things.
k	Undocumented
r:int	The number of times a thing is taken.

def clean_html(html): (source) ¶

Undocumented

def clean_url(url): (source) ¶

Undocumented

def elementtree_indent(elem, level=0): (source) ¶

Recursive function to indent an ElementTree._ElementInterface used for pretty printing. Run indent on elem and then output in the normal way.

Parameters
elem:ElementTree._ElementInterface	element to be indented. will be modified.
level:nonnegative integer	level of indentation for this element
Returns
ElementTree._ElementInterface	Contents of elem indented to reflect its structure

def everygrams(sequence, min_len=1, max_len=-1, pad_left=False, pad_right=False, **kwargs): (source) ¶

Returns all possible ngrams generated from a sequence of items, as an iterator.

>>> sent = 'a b c'.split()

New version outputs for everygrams.

>>> list(everygrams(sent))
[('a',), ('a', 'b'), ('a', 'b', 'c'), ('b',), ('b', 'c'), ('c',)]

Old version outputs for everygrams.

>>> sorted(everygrams(sent), key=len)
[('a',), ('b',), ('c',), ('a', 'b'), ('b', 'c'), ('a', 'b', 'c')]

>>> list(everygrams(sent, max_len=2))
[('a',), ('a', 'b'), ('b',), ('b', 'c'), ('c',)]

Parameters
sequence:sequence or iter	the source data to be converted into ngrams. If max_len is not provided, this sequence will be loaded into memory
min_len:int	minimum length of the ngrams, aka. n-gram order/degree of ngram
max_len:int	maximum length of the ngrams (set to length of sequence by default)
pad_left:bool	whether the ngrams should be left-padded
pad_right:bool	whether the ngrams should be right-padded
**kwargs	Undocumented
Returns
iter(tuple)	Undocumented

def filestring(f): (source) ¶

Undocumented

def flatten(*args): (source) ¶

Flatten a list.

>>> from nltk.util import flatten
>>> flatten(1, 2, ['b', 'a' , ['c', 'd']], 3)
[1, 2, 'b', 'a', 'c', 'd', 3]

Parameters
*args	items and lists to be combined into a single list
Returns
list	Undocumented

def guess_encoding(data): (source) ¶

Given a byte string, attempt to decode it. Tries the standard 'UTF8' and 'latin-1' encodings, Plus several gathered from locale information.

The calling program must first call:

locale.setlocale(locale.LC_ALL, '')

If successful it returns (decoded_unicode, successful_encoding). If unsuccessful it raises a UnicodeError.

def in_idle(): (source) ¶

Return True if this function is run within idle. Tkinter programs that are run in idle should never call Tk.mainloop; so this function should be used to gate all calls to Tk.mainloop.

Returns
bool	Undocumented
Unknown Field: warning
This function works by checking `sys.stdin`. If the user has modified `sys.stdin`, then it may return incorrect results.

def invert_dict(d): (source) ¶

Undocumented

def invert_graph(graph): (source) ¶

Inverts a directed graph.

Parameters
graph:dict(set)	the graph, represented as a dictionary of sets
Returns
dict(set)	the inverted graph

def ngrams(sequence, n, **kwargs): (source) ¶

Return the ngrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Wrap with list for a list version of this function. Set pad_left or pad_right to true in order to get additional ngrams:

>>> list(ngrams([1,2,3,4,5], 2, pad_right=True))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]
>>> list(ngrams([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[(1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5)]
>>> list(ngrams([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
[('<s>', 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, '</s>')]

Parameters
sequence:sequence or iter	the source data to be converted into ngrams
n:int	the degree of the ngrams
pad_left:bool	whether the ngrams should be left-padded
pad_right:bool	whether the ngrams should be right-padded
left_pad_symbol:any	the symbol to use for left padding (default is None)
right_pad_symbol:any	the symbol to use for right padding (default is None)
**kwargs	Undocumented
Returns
sequence or iter	Undocumented

def pad_sequence(sequence, n, pad_left=False, pad_right=False, left_pad_symbol=None, right_pad_symbol=None): (source) ¶

Returns a padded sequence of items before ngram extraction.

>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
['<s>', 1, 2, 3, 4, 5, '</s>']
>>> list(pad_sequence([1,2,3,4,5], 2, pad_left=True, left_pad_symbol='<s>'))
['<s>', 1, 2, 3, 4, 5]
>>> list(pad_sequence([1,2,3,4,5], 2, pad_right=True, right_pad_symbol='</s>'))
[1, 2, 3, 4, 5, '</s>']

Parameters
sequence:sequence or iter	the source data to be padded
n:int	the degree of the ngrams
pad_left:bool	whether the ngrams should be left-padded
pad_right:bool	whether the ngrams should be right-padded
left_pad_symbol:any	the symbol to use for left padding (default is None)
right_pad_symbol:any	the symbol to use for right padding (default is None)
Returns
sequence or iter	Undocumented

def pairwise(iterable): (source) ¶

s -> (s0,s1), (s1,s2), (s2, s3), ...

def parallelize_preprocess(func, iterator, processes, progress_bar=False): (source) ¶

Undocumented

def pr(data, start=0, end=None): (source) ¶

Pretty print a sequence of data items

Parameters
data:sequence or iter	the data stream to print
start:int	the start position
end:int	the end position

def print_string(s, width=70): (source) ¶

Pretty print a string, breaking lines on whitespace

Parameters
s:str	the string to print, consisting of words and spaces
width:int	the display width

def py25(): (source) ¶

Undocumented

def re_show(regexp, string, left='{', right='}'): (source) ¶

Return a string with markers surrounding the matched substrings. Search str for substrings matching regexp and wrap the matches with braces. This is convenient for learning about regular expressions.

Parameters
regexp:str	The regular expression.
string:str	The string being matched.
left:str	The left delimiter (printed before the matched substring)
right:str	The right delimiter (printed after the matched substring)
Returns
str	Undocumented

def set_proxy(proxy, user=None, password=''): (source) ¶

Set the HTTP proxy for Python to download through.

If proxy is None then tries to set proxy from environment or system settings.

Parameters
proxy	The HTTP proxy server to use. For example: 'http://proxy.example.com:3128/'
user	The username to authenticate with. Use None to disable authentication.
password	The password to authenticate with.

def skipgrams(sequence, n, k, **kwargs): (source) ¶

Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

Parameters
sequence:sequence or iter	the source data to be converted into trigrams
n:int	the degree of the ngrams
k:int	the skip distance
**kwargs	Undocumented
Returns
iter(tuple)	Undocumented

def tokenwrap(tokens, separator=' ', width=70): (source) ¶

Pretty print a list of text tokens, breaking lines on whitespace

Parameters
tokens:list	the tokens to print
separator:str	the string to use to separate tokens
width:int	the display width (default=70)

def transitive_closure(graph, reflexive=False): (source) ¶

Calculate the transitive closure of a directed graph, optionally the reflexive transitive closure.

The algorithm is a slight modification of the "Marking Algorithm" of Ioannidis & Ramakrishnan (1998) "Efficient Transitive Closure Algorithms".

Parameters
graph:dict(set)	the initial graph, represented as a dictionary of sets
reflexive:bool	if set, also make the closure reflexive
Returns
dict(set)	Undocumented

def trigrams(sequence, **kwargs): (source) ¶

Return the trigrams generated from a sequence of items, as an iterator. For example:

>>> from nltk.util import trigrams
>>> list(trigrams([1,2,3,4,5]))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]

Use trigrams for a list version of this function.

Parameters
sequence:sequence or iter	the source data to be converted into trigrams
**kwargs	Undocumented
Returns
iter(tuple)	Undocumented

def unique_list(xs): (source) ¶

Undocumented

def unweighted_minimum_spanning_tree(tree, children=iter): (source) ¶

Output a Minimum Spanning Tree (MST) of an unweighted graph, by traversing the nodes of a tree in breadth-first order, discarding eventual cycles.

The first argument should be the tree root; children should be a function taking as argument a tree node and returning an iterator of the node's children.

>>> import nltk
>>> from nltk.util import unweighted_minimum_spanning_tree as mst
>>> wn=nltk.corpus.wordnet
>>> from pprint import pprint
>>> pprint(mst(wn.synset('bound.a.01'), lambda s:s.also_sees()))
[Synset('bound.a.01'),
 [Synset('unfree.a.02'),
  [Synset('confined.a.02')],
  [Synset('dependent.a.01')],
  [Synset('restricted.a.01'), [Synset('classified.a.02')]]]]

def usage(obj, selfname='self'): (source) ¶

Undocumented