class documentation

class Text(object): (source)

Known subclasses: nltk.text.TextCollection

Constructor: Text(tokens, name)

View In Hierarchy

A wrapper around a sequence of simple (string) tokens, which is intended to support initial exploration of texts (via the interactive console). Its methods perform a variety of analyses on the text's contexts (e.g., counting, concordancing, collocation discovery), and display the results. If you wish to write a program which makes use of these analyses, then you should bypass the Text class, and use the appropriate analysis function or class directly instead.

A Text is typically initialized from a given document or corpus. E.g.:

>>> import nltk.corpus
>>> from nltk.text import Text
>>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
Method __getitem__ Undocumented
Method __init__ Create a Text object.
Method __len__ Undocumented
Method __repr__ Undocumented
Method __str__ Undocumented
Method collocation_list Return collocations derived from the text, ignoring stopwords.
Method collocations Print collocations derived from the text, ignoring stopwords.
Method common_contexts Find contexts where the specified words appear; list most frequent common contexts first.
Method concordance Prints a concordance for word with the specified context window. Word matching is not case-sensitive.
Method concordance_list Generate a concordance for word with the specified context window. Word matching is not case-sensitive.
Method count Count the number of times this word appears in the text.
Method dispersion_plot Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.
Method findall Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
Method generate Print random text, generated using a trigram language model. See also help(nltk.lm).
Method index Find the index of the first occurrence of the word in the text.
Method plot See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot()
Method readability Undocumented
Method similar Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
Method vocab No summary
Instance Variable name Undocumented
Instance Variable tokens Undocumented
Method _context One left & one right token, both case-normalized. Skip over non-sentence-final punctuation. Used by the ContextIndex that is created for similar() and common_contexts().
Method _train_default_ngram_lm Undocumented
Constant _CONTEXT_RE Undocumented
Constant _COPY_TOKENS Undocumented
Instance Variable _collocations Undocumented
Instance Variable _concordance_index Undocumented
Instance Variable _num Undocumented
Instance Variable _token_searcher Undocumented
Instance Variable _tokenized_sents Undocumented
Instance Variable _trigram_model Undocumented
Instance Variable _vocab Undocumented
Instance Variable _window_size Undocumented
Instance Variable _word_context_index Undocumented
def __getitem__(self, i): (source)

Undocumented

def __init__(self, tokens, name=None): (source)

Create a Text object.

Parameters
tokens:sequence of strThe source text.
nameUndocumented
def __len__(self): (source)

Undocumented

def __repr__(self): (source)

Undocumented

def __str__(self): (source)

Undocumented

def collocation_list(self, num=20, window_size=2): (source)

Return collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocation_list()[:2]
[('United', 'States'), ('fellow', 'citizens')]
Parameters
num:intThe maximum number of collocations to return.
window_size:intThe number of tokens spanned by a collocation (default=2)
Returns
list(tuple(str, str))Undocumented
def collocations(self, num=20, window_size=2): (source)

Print collocations derived from the text, ignoring stopwords.

>>> from nltk.book import text4
>>> text4.collocations() # doctest: +ELLIPSIS
United States; fellow citizens; four years; ...
Parameters
num:intThe maximum number of collocations to print.
window_size:intThe number of tokens spanned by a collocation (default=2)
def common_contexts(self, words, num=20): (source)

Find contexts where the specified words appear; list most frequent common contexts first.

Parameters
words:strThe words used to seed the similarity search
num:intThe number of words to generate (default=20)
See Also
ContextIndex.common_contexts()
def concordance(self, word, width=79, lines=25): (source)

Prints a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
word:str or listThe target word or phrase (a list of strings)
width:intThe width of each line, in characters (default=80)
lines:intThe number of lines to display (default=25)
See Also
ConcordanceIndex
def concordance_list(self, word, width=79, lines=25): (source)

Generate a concordance for word with the specified context window. Word matching is not case-sensitive.

Parameters
word:str or listThe target word or phrase (a list of strings)
width:intThe width of each line, in characters (default=80)
lines:intThe number of lines to display (default=25)
See Also
ConcordanceIndex
def count(self, word): (source)

Count the number of times this word appears in the text.

def dispersion_plot(self, words): (source)

Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.

Parameters
words:list(str)The words to be plotted
See Also
nltk.draw.dispersion_plot()
def findall(self, regexp): (source)

Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.

>>> print('hack'); from nltk.book import text1, text5, text9
hack...
>>> text5.findall("<.*><.*><bro>")
you rule bro; telling you bro; u twizted bro
>>> text1.findall("<a>(<.*>)<man>")
monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
>>> text9.findall("<th.*>{3,}")
thread through those; the thought that; that the thing; the thing
that; that that thing; through these than through; them that the;
through the thick; them that they; thought that the
Parameters
regexp:strA regular expression
def generate(self, length=100, text_seed=None, random_seed=42): (source)

Print random text, generated using a trigram language model. See also help(nltk.lm).

makes the random sampling part of generation reproducible. (default=42) :type random_seed: int

Parameters
length:intThe length of text to generate (default=100)
text_seed:list(str)Generation can be conditioned on preceding context.
random_seedA random seed or an instance of random.Random. If provided,
def index(self, word): (source)

Find the index of the first occurrence of the word in the text.

def plot(self, *args): (source)

See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot()

def readability(self, method): (source)

Undocumented

def similar(self, word, num=20): (source)

Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.

Parameters
word:strThe word used to seed the similarity search
num:intThe number of words to generate (default=20)
See Also
ContextIndex.similar_words()
def vocab(self): (source)
See Also
nltk.prob.FreqDist

Undocumented

Undocumented

def _context(self, tokens, i): (source)

One left & one right token, both case-normalized. Skip over non-sentence-final punctuation. Used by the ContextIndex that is created for similar() and common_contexts().

def _train_default_ngram_lm(self, tokenized_sents, n=3): (source)

Undocumented

_CONTEXT_RE = (source)

Undocumented

Value
re.compile(r'\w+|[\.!\?]')
_COPY_TOKENS: bool = (source)

Undocumented

Value
True
_collocations = (source)

Undocumented

_concordance_index = (source)

Undocumented

Undocumented

_token_searcher = (source)

Undocumented

_tokenized_sents = (source)

Undocumented

_trigram_model = (source)

Undocumented

Undocumented

_window_size = (source)

Undocumented

_word_context_index = (source)

Undocumented