class Text(object): (source)
Known subclasses: nltk.text.TextCollection
Constructor: Text(tokens, name)
A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.
A Text is typically initialized from a given document or corpus. E.g.:
>>> import nltk.corpus >>> from nltk.text import Text >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
Method | __getitem__ |
Undocumented |
Method | __init__ |
Create a Text object. |
Method | __len__ |
Undocumented |
Method | __repr__ |
Undocumented |
Method | __str__ |
Undocumented |
Method | collocation |
Return collocations derived from the text, ignoring stopwords. |
Method | collocations |
Print collocations derived from the text, ignoring stopwords. |
Method | common |
Find contexts where the specified words appear; list most frequent common contexts first. |
Method | concordance |
Prints a concordance for word with the specified context window. Word matching is not case-sensitive. |
Method | concordance |
Generate a concordance for word with the specified context window. Word matching is not case-sensitive. |
Method | count |
Count the number of times this word appears in the text. |
Method | dispersion |
Produce a plot showing the distribution of the words through the text. Requires pylab to be installed. |
Method | findall |
Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g. |
Method | generate |
Print random text, generated using a trigram language model. See also help(nltk.lm) . |
Method | index |
Find the index of the first occurrence of the word in the text. |
Method | plot |
See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot() |
Method | readability |
Undocumented |
Method | similar |
Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first. |
Method | vocab |
No summary |
Instance Variable | name |
Undocumented |
Instance Variable | tokens |
Undocumented |
Method | _context |
One left & one right token, both case-normalized. Skip over non-sentence-final punctuation. Used by the ContextIndex that is created for similar() and common_contexts(). |
Method | _train |
Undocumented |
Constant | _CONTEXT |
Undocumented |
Constant | _COPY |
Undocumented |
Instance Variable | _collocations |
Undocumented |
Instance Variable | _concordance |
Undocumented |
Instance Variable | _num |
Undocumented |
Instance Variable | _token |
Undocumented |
Instance Variable | _tokenized |
Undocumented |
Instance Variable | _trigram |
Undocumented |
Instance Variable | _vocab |
Undocumented |
Instance Variable | _window |
Undocumented |
Instance Variable | _word |
Undocumented |
nltk.text.TextCollection
Create a Text object.
Parameters | |
tokens:sequence of str | The source text. |
name | Undocumented |
Return collocations derived from the text, ignoring stopwords.
>>> from nltk.book import text4 >>> text4.collocation_list()[:2] [('United', 'States'), ('fellow', 'citizens')]
Parameters | |
num:int | The maximum number of collocations to return. |
window | The number of tokens spanned by a collocation (default=2) |
Returns | |
list(tuple(str, str)) | Undocumented |
Print collocations derived from the text, ignoring stopwords.
>>> from nltk.book import text4 >>> text4.collocations() # doctest: +ELLIPSIS United States; fellow citizens; four years; ...
Parameters | |
num:int | The maximum number of collocations to print. |
window | The number of tokens spanned by a collocation (default=2) |
Find contexts where the specified words appear; list most frequent common contexts first.
Parameters | |
words:str | The words used to seed the similarity search |
num:int | The number of words to generate (default=20) |
See Also | |
ContextIndex.common_contexts() |
Prints a concordance for word with the specified context window. Word matching is not case-sensitive.
Parameters | |
word:str or list | The target word or phrase (a list of strings) |
width:int | The width of each line, in characters (default=80) |
lines:int | The number of lines to display (default=25) |
See Also | |
ConcordanceIndex |
Generate a concordance for word with the specified context window. Word matching is not case-sensitive.
Parameters | |
word:str or list | The target word or phrase (a list of strings) |
width:int | The width of each line, in characters (default=80) |
lines:int | The number of lines to display (default=25) |
See Also | |
ConcordanceIndex |
Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.
Parameters | |
words:list(str) | The words to be plotted |
See Also | |
nltk.draw.dispersion_plot() |
Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
>>> print('hack'); from nltk.book import text1, text5, text9 hack... >>> text5.findall("<.*><.*><bro>") you rule bro; telling you bro; u twizted bro >>> text1.findall("<a>(<.*>)<man>") monied; nervous; dangerous; white; white; white; pious; queer; good; mature; white; Cape; great; wise; wise; butterless; white; fiendish; pale; furious; better; certain; complete; dismasted; younger; brave; brave; brave; brave >>> text9.findall("<th.*>{3,}") thread through those; the thought that; that the thing; the thing that; that that thing; through these than through; them that the; through the thick; them that they; thought that the
Parameters | |
regexp:str | A regular expression |
Print random text, generated using a trigram language model.
See also help(nltk.lm)
.
makes the random sampling part of generation reproducible. (default=42) :type random_seed: int
Parameters | |
length:int | The length of text to generate (default=100) |
text | Generation can be conditioned on preceding context. |
random | A random seed or an instance of random.Random . If provided, |
Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
Parameters | |
word:str | The word used to seed the similarity search |
num:int | The number of words to generate (default=20) |
See Also | |
ContextIndex.similar_words() |