module documentation

Undocumented

Class ConcatenatedCorpusView A 'view' of a corpus file that joins together one or more StreamBackedCorpusViews<StreamBackedCorpusView>. At most one file handle is left open at any time.
Class PickleCorpusView A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using pickle.dump). One use case for this class is to store the result of running feature detection on a corpus to disk...
Class StreamBackedCorpusView A 'view' of a corpus file, which acts like a sequence of tokens: it can be accessed by index, iterated over, etc. However, the tokens are only constructed as-needed -- the entire corpus is never stored in memory at once.
Function concat Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.
Function read_alignedsent_block Undocumented
Function read_blankline_block Undocumented
Function read_line_block Undocumented
Function read_regexp_block Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching ...
Function read_sexpr_block Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.
Function read_whitespace_block Undocumented
Function read_wordpunct_block Undocumented
Function _parse_sexpr_block Undocumented
Function _path_from Undocumented
Function _sub_space Helper function: given a regexp match, return a string of spaces that's the same length as the matched string.
def concat(docs): (source)

Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.

def read_alignedsent_block(stream): (source)

Undocumented

def read_blankline_block(stream): (source)

Undocumented

def read_line_block(stream): (source)

Undocumented

def read_regexp_block(stream, start_re, end_re=None): (source)

Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching start_re or EOF is found.

def read_sexpr_block(stream, block_size=16384, comment_char=None): (source)

Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.

If the file ends in in the middle of an s-expression, then that incomplete s-expression is returned when the end of the file is reached.

Parameters
streamUndocumented
block_sizeThe default block size for reading. If an s-expression is longer than one block, then more than one block will be read.
comment_charA character that marks comments. Any lines that begin with this character will be stripped out. (If spaces or tabs precede the comment character, then the line will not be stripped.)
def read_whitespace_block(stream): (source)

Undocumented

def read_wordpunct_block(stream): (source)

Undocumented

def _parse_sexpr_block(block): (source)

Undocumented

def _path_from(parent, child): (source)

Undocumented

def _sub_space(m): (source)

Helper function: given a regexp match, return a string of spaces that's the same length as the matched string.