Undocumented
Class |
|
A 'view' of a corpus file that joins together one or more StreamBackedCorpusViews<StreamBackedCorpusView>. At most one file handle is left open at any time. |
Class |
|
A stream backed corpus view for corpus files that consist of sequences of serialized Python objects (serialized using pickle.dump). One use case for this class is to store the result of running feature detection on a corpus to disk... |
Class |
|
A 'view' of a corpus file, which acts like a sequence of tokens: it can be accessed by index, iterated over, etc. However, the tokens are only constructed as-needed -- the entire corpus is never stored in memory at once. |
Function | concat |
Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time. |
Function | read |
Undocumented |
Function | read |
Undocumented |
Function | read |
Undocumented |
Function | read |
Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching ... |
Function | read |
Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file. |
Function | read |
Undocumented |
Function | read |
Undocumented |
Function | _parse |
Undocumented |
Function | _path |
Undocumented |
Function | _sub |
Helper function: given a regexp match, return a string of spaces that's the same length as the matched string. |
Concatenate together the contents of multiple documents from a single corpus, using an appropriate concatenation function. This utility function is used by corpus readers when the user requests more than one document at a time.
Read a sequence of tokens from a stream, where tokens begin with lines that match start_re. If end_re is specified, then tokens end with lines that match end_re; otherwise, tokens end whenever the next line matching start_re or EOF is found.
Read a sequence of s-expressions from the stream, and leave the stream's file position at the end the last complete s-expression read. This function will always return at least one s-expression, unless there are no more s-expressions in the file.
If the file ends in in the middle of an s-expression, then that incomplete s-expression is returned when the end of the file is reached.
Parameters | |
stream | Undocumented |
block | The default block size for reading. If an s-expression is longer than one block, then more than one block will be read. |
comment | A character that marks comments. Any lines that begin with this character will be stripped out. (If spaces or tabs precede the comment character, then the line will not be stripped.) |