nltk.text.Text

class documentation

class Text(object): (source)

Known subclasses: nltk.text.TextCollection

Constructor: Text(tokens, name)

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

A Text is typically initialized from a given document or corpus. E.g.:

>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))

Method	`__getitem__`	Undocumented
Method	`__init__`	Create a Text object.
Method	`__len__`	Undocumented
Method	`__repr__`	Undocumented
Method	`__str__`	Undocumented
Method	`collocation_list`	Return collocations derived from the text, ignoring stopwords.
Method	`collocations`	Print collocations derived from the text, ignoring stopwords.
Method	`common_contexts`	Find contexts where the specified words appear; list most frequent common contexts first.
Method	`concordance`	Prints a concordance for `word` with the specified context window. Word matching is not case-sensitive.
Method	`concordance_list`	Generate a concordance for `word` with the specified context window. Word matching is not case-sensitive.
Method	`count`	Count the number of times this word appears in the text.
Method	`dispersion_plot`	Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.
Method	`findall`	Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
Method	`generate`	Print random text, generated using a trigram language model. See also `help(nltk.lm)`.
Method	`index`	Find the index of the first occurrence of the word in the text.
Method	`plot`	See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot()
Method	`readability`	Undocumented
Method	`similar`	Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
Method	`vocab`	No summary
Instance Variable	`name`	Undocumented
Instance Variable	`tokens`	Undocumented
Method	`_context`	One left & one right token, both case-normalized. Skip over non-sentence-final punctuation. Used by the `ContextIndex` that is created for `similar()` and `common_contexts()`.
Method	`_train_default_ngram_lm`	Undocumented
Constant	`_CONTEXT_RE`	Undocumented
Constant	`_COPY_TOKENS`	Undocumented
Instance Variable	`_collocations`	Undocumented
Instance Variable	`_concordance_index`	Undocumented
Instance Variable	`_num`	Undocumented
Instance Variable	`_token_searcher`	Undocumented
Instance Variable	`_tokenized_sents`	Undocumented
Instance Variable	`_trigram_model`	Undocumented
Instance Variable	`_vocab`	Undocumented
Instance Variable	`_window_size`	Undocumented
Instance Variable	`_word_context_index`	Undocumented

def __getitem__(self, i): (source) ¶

Undocumented

def __init__(self, tokens, name=None): (source) ¶

overridden in nltk.text.TextCollection

Create a Text object.

Parameters
tokens:sequence of str	The source text.
name	Undocumented

def __len__(self): (source) ¶

Undocumented

def __repr__(self): (source) ¶

Undocumented

def __str__(self): (source) ¶

Undocumented

def collocation_list(self, num=20, window_size=2): (source) ¶

Return collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocation_list()[:2]
[('United', 'States'), ('fellow', 'citizens')]

Parameters
num:int	The maximum number of collocations to return.
window_size:int	The number of tokens spanned by a collocation (default=2)
Returns
list(tuple(str, str))	Undocumented

def collocations(self, num=20, window_size=2): (source) ¶

Print collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocations() # doctest: +ELLIPSIS
United States; fellow citizens; four years; ...

Parameters
num:int	The maximum number of collocations to print.
window_size:int	The number of tokens spanned by a collocation (default=2)

def common_contexts(self, words, num=20): (source) ¶

Find contexts where the specified words appear; list most frequent common contexts first.

Parameters
words:str	The words used to seed the similarity search
num:int	The number of words to generate (default=20)
See Also
ContextIndex.common_contexts()

def concordance(self, word, width=79, lines=25): (source) ¶

Prints a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
word:str or list	The target word or phrase (a list of strings)
width:int	The width of each line, in characters (default=80)
lines:int	The number of lines to display (default=25)
See Also
`ConcordanceIndex`

def concordance_list(self, word, width=79, lines=25): (source) ¶

Generate a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
word:str or list	The target word or phrase (a list of strings)
width:int	The width of each line, in characters (default=80)
lines:int	The number of lines to display (default=25)
See Also
`ConcordanceIndex`

def count(self, word): (source) ¶

Count the number of times this word appears in the text.

def dispersion_plot(self, words): (source) ¶

Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.

Parameters
words:list(str)	The words to be plotted
See Also
nltk.draw.dispersion_plot()

def findall(self, regexp): (source) ¶

Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.

>>> print('hack'); from nltk.book import text1, text5, text9
hack...
>>> text5.findall("<.*><.*><bro>")
you rule bro; telling you bro; u twizted bro
>>> text1.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> text9.findall("<th.*>{3,}")
thread through those; the thought that; that the thing; the thing
that; that that thing; through these than through; them that the;
through the thick; them that they; thought that the

Parameters
regexp:str	A regular expression

def generate(self, length=100, text_seed=None, random_seed=42): (source) ¶

Print random text, generated using a trigram language model. See also help(nltk.lm).

makes the random sampling part of generation reproducible. (default=42) :type random_seed: int

Parameters
length:int	The length of text to generate (default=100)
text_seed:list(str)	Generation can be conditioned on preceding context.
random_seed	A random seed or an instance of `random.Random`. If provided,

def index(self, word): (source) ¶

Find the index of the first occurrence of the word in the text.

def plot(self, *args): (source) ¶

See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot()

def readability(self, method): (source) ¶

Undocumented

def similar(self, word, num=20): (source) ¶

Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.

Parameters
word:str	The word used to seed the similarity search
num:int	The number of words to generate (default=20)
See Also
ContextIndex.similar_words()

def vocab(self): (source) ¶

See Also
nltk.prob.FreqDist

name = (source) ¶

Undocumented

tokens = (source) ¶

Undocumented

def _context(self, tokens, i): (source) ¶

One left & one right token, both case-normalized. Skip over non-sentence-final punctuation. Used by the ContextIndex that is created for similar() and common_contexts().

def _train_default_ngram_lm(self, tokenized_sents, n=3): (source) ¶

Undocumented

_CONTEXT_RE = (source) ¶

Undocumented

Value