nltk.text.TextCollection

class documentation

class TextCollection(Text): (source)

Constructor: TextCollection(source)

A collection of texts, which can be loaded with list of texts, or with a corpus consisting of one or more texts, and which supports counting, concordancing, collocation discovery, etc. Initialize a TextCollection as follows:

>>> import nltk.corpus
>>> from nltk.text import TextCollection
>>> print('hack'); from nltk.book import text1, text2, text3
hack...
>>> gutenberg = TextCollection(nltk.corpus.gutenberg)
>>> mytexts = TextCollection([text1, text2, text3])

Iterating over a TextCollection produces all the tokens of all the texts in order.

Method	`__init__`	Create a Text object.
Method	`idf`	The number of texts in the corpus divided by the number of texts that the term appears in. If a term does not appear in the corpus, 0.0 is returned.
Method	`tf`	The frequency of the term in text.
Method	`tf_idf`	Undocumented
Instance Variable	`_idf_cache`	Undocumented
Instance Variable	`_texts`	Undocumented

Inherited from Text:

Method	`__getitem__`	Undocumented
Method	`__len__`	Undocumented
Method	`__repr__`	Undocumented
Method	`__str__`	Undocumented
Method	`collocation_list`	Return collocations derived from the text, ignoring stopwords.
Method	`collocations`	Print collocations derived from the text, ignoring stopwords.
Method	`common_contexts`	Find contexts where the specified words appear; list most frequent common contexts first.
Method	`concordance`	Prints a concordance for `word` with the specified context window. Word matching is not case-sensitive.
Method	`concordance_list`	Generate a concordance for `word` with the specified context window. Word matching is not case-sensitive.
Method	`count`	Count the number of times this word appears in the text.
Method	`dispersion_plot`	Produce a plot showing the distribution of the words through the text. Requires pylab to be installed.
Method	`findall`	Find instances of the regular expression in the text. The text is a list of tokens, and a regexp pattern to match a single token must be surrounded by angle brackets. E.g.
Method	`generate`	Print random text, generated using a trigram language model. See also `help(nltk.lm)`.
Method	`index`	Find the index of the first occurrence of the word in the text.
Method	`plot`	See documentation for FreqDist.plot() :seealso: nltk.prob.FreqDist.plot()
Method	`readability`	Undocumented
Method	`similar`	Distributional similarity: find other words which appear in the same contexts as the specified word; list most similar words first.
Method	`vocab`	No summary
Instance Variable	`name`	Undocumented
Instance Variable	`tokens`	Undocumented
Method	`_context`	One left & one right token, both case-normalized. Skip over non-sentence-final punctuation. Used by the `ContextIndex` that is created for `similar()` and `common_contexts()`.
Method	`_train_default_ngram_lm`	Undocumented
Constant	`_CONTEXT_RE`	Undocumented
Constant	`_COPY_TOKENS`	Undocumented
Instance Variable	`_collocations`	Undocumented
Instance Variable	`_concordance_index`	Undocumented
Instance Variable	`_num`	Undocumented
Instance Variable	`_token_searcher`	Undocumented
Instance Variable	`_tokenized_sents`	Undocumented
Instance Variable	`_trigram_model`	Undocumented
Instance Variable	`_vocab`	Undocumented
Instance Variable	`_window_size`	Undocumented
Instance Variable	`_word_context_index`	Undocumented

def __init__(self, source): (source) ¶

overrides nltk.text.Text.__init__

Create a Text object.

Parameters
source	Undocumented
tokens:sequence of str	The source text.

def idf(self, term): (source) ¶

The number of texts in the corpus divided by the number of texts that the term appears in. If a term does not appear in the corpus, 0.0 is returned.

def tf(self, term, text): (source) ¶

The frequency of the term in text.

def tf_idf(self, term, text): (source) ¶

Undocumented

_idf_cache: dict = (source) ¶

Undocumented

_texts = (source) ¶

Undocumented